Projects
Essentials
x265
Sign Up
Log In
Username
Password
Overview
Repositories
Revisions
Requests
Users
Attributes
Meta
Expand all
Collapse all
Changes of Revision 27
View file
x265.changes
Changed
@@ -1,35 +1,4 @@ ------------------------------------------------------------------- -Thu Mar 1 23:14:47 UTC 2018 - zaitor@opensuse.org - -- Update to version 2.7: - * New features: - - option:`--gop-lookahead` can be used to extend the gop - boundary(set by `--keyint`). The GOP will be extended, if a - scene-cut frame is found within this many number of frames. - - Support for RADL pictures added in x265. - - option:`--radl` can be used to decide number of RADL pictures - preceding the IDR picture. - * Encoder enhancements: - - Moved from YASM to NASM assembler. Supports NASM assembler - version 2.13 and greater. - - Enable analysis save and load in a single run. Introduces two - new cli options `--analysis-save <filename>` and - `--analysis-load <filename>`. - - Comply to HDR10+ LLC specification. - - Reduced x265 build time by more than 50% by re-factoring - ipfilter.asm. - * Bug fixes: - - Fixed inconsistent output issue in deblock filter and - --const-vbv. - - Fixed Mac OS build warnings. - - Fixed inconsistency in pass-2 when weightp and cutree are - enabled. - - Fixed deadlock issue due to dropping of BREF frames, while - forcing slice types through qp file. -- Bump soname to 151, also in baselibs.conf following upstream - changes. - -------------------------------------------------------------------- Fri Dec 01 16:40:13 UTC 2017 - joerg.lorenzen@ki.tng.de - Update to version 2.6
View file
x265.spec
Changed
@@ -1,10 +1,10 @@ # based on the spec file from https://build.opensuse.org/package/view_file/home:Simmphonie/libx265/ Name: x265 -%define soname 151 +%define soname 146 %define libname lib%{name} %define libsoname %{libname}-%{soname} -Version: 2.7 +Version: 2.6 Release: 0 License: GPL-2.0+ Summary: A free h265/HEVC encoder - encoder binary @@ -49,7 +49,7 @@ streams. %prep -%setup -q -n %{name}_%{version} +%setup -q -n %{name}_v%{version} %patch0 -p1 %patch1 -p1
View file
baselibs.conf
Changed
@@ -1,1 +1,1 @@ -libx265-151 +libx265-130
View file
x265_2.7.tar.gz/.hg_archival.txt -> x265_2.6.tar.gz/.hg_archival.txt
Changed
@@ -1,4 +1,4 @@ repo: 09fe40627f03a0f9c3e6ac78b22ac93da23f9fdf -node: e41a9bf2bac4a7af2bec2bbadf91e63752d320ef +node: 0e9ea76945c89962cd46cee6537586e2054b2935 branch: stable -tag: 2.7 +tag: 2.6
View file
x265_2.7.tar.gz/.hgtags -> x265_2.6.tar.gz/.hgtags
Changed
@@ -24,4 +24,3 @@ 3037c1448549ca920967831482c653e5892fa8ed 2.3 e7a4dd48293b7956d4a20df257d23904cc78e376 2.4 64b2d0bf45a52511e57a6b7299160b961ca3d51c 2.5 -0e9ea76945c89962cd46cee6537586e2054b2935 2.6
View file
x265_2.7.tar.gz/build/README.txt -> x265_2.6.tar.gz/build/README.txt
Changed
@@ -9,8 +9,7 @@ = Optional Prerequisites = -1. To compile assembly primitives (performance) - a) If you are using release 2.6 or older, download and install Yasm 1.2.0 or later, +1. Yasm 1.2.0 or later, to compile assembly primitives (performance) For Windows, download the latest yasm executable http://yasm.tortall.net/Download.html and copy the EXE into @@ -34,24 +33,6 @@ If cpu capabilities line says 'none!', then the encoder was built without yasm. - b) If you are building from the default branch after release 2.6, download and install nasm 2.13 or newer - - For windows and linux, you can download the nasm installer from http://www.nasm.us/pub/nasm/releasebuilds/?C=M;O=D. - Make sure that it is in your PATH environment variable (%PATH% in windows, and $PATH in linux) so that cmake - can find it. - - Once NASM is properly installed, run cmake to regenerate projects. If you - do not see the below line in the cmake output, NASM is not in the PATH. - - -- Found Nasm 2.13 to build assembly primitives - - Now build the encoder and run x265 -V: - - x265 [info]: using cpu capabilities: MMX, SSE2, ... - - If cpu capabilities line says 'none!', then the encoder was built - without nasm and will be considerably slower for performance. - 2. VisualLeakDetector (Windows Only) Download from https://vld.codeplex.com/releases and install. May need
View file
x265_2.7.tar.gz/doc/reST/api.rst -> x265_2.6.tar.gz/doc/reST/api.rst
Changed
@@ -206,7 +206,7 @@ /* x265_get_ref_frame_list: * returns negative on error, 0 when access unit were output. * This API must be called after(poc >= lookaheadDepth + bframes + 2) condition check */ - int x265_get_ref_frame_list(x265_encoder *encoder, x265_picyuv**, x265_picyuv**, int, int, int*, int*); + int x265_get_ref_frame_list(x265_encoder *encoder, x265_picyuv**, x265_picyuv**, int, int); **x265_encoder_ctu_info** may be used to provide additional CTU-specific information to the encoder::
View file
x265_2.7.tar.gz/doc/reST/cli.rst -> x265_2.6.tar.gz/doc/reST/cli.rst
Changed
@@ -863,22 +863,21 @@ sequence multiple times (presumably at varying bitrates). The encoder will not reuse analysis if slice type parameters do not match. -.. option:: --analysis-save <filename> +.. option:: --analysis-reuse-mode <string|int> - Encoder outputs analysis information of each frame. Analysis data from save mode is - written to the file specified. Requires cutree, pmode to be off. Default disabled. - -.. option:: --analysis-load <filename> - - Encoder reuses analysis information from the file specified. By reading the analysis data writen by - an earlier encode of the same sequence, substantial redundant work may be avoided. Requires cutree, pmode - to be off. Default disabled. + This option allows reuse of analysis information from first pass to second pass. + :option:`--analysis-reuse-mode save` specifies that encoder outputs analysis information of each frame. + :option:`--analysis-reuse-mode load` specifies that encoder reuses analysis information from first pass. + There is no benefit using load mode without running encoder in save mode. Analysis data from save mode is + written to a file specified by :option:`--analysis-reuse-file`. The amount of analysis data stored/reused + is determined by :option:`--analysis-reuse-level`. By reading the analysis data writen by an earlier encode + of the same sequence, substantial redundant work may be avoided. Requires cutree, pmode to be off. Default 0. - The amount of analysis data stored/reused is determined by :option:`--analysis-reuse-level`. + **Values:** off(0), save(1): dump analysis data, load(2): read analysis data .. option:: --analysis-reuse-file <filename> - Specify a filename for `multi-pass-opt-analysis` and `multi-pass-opt-distortion`. + Specify a filename for analysis data (see :option:`--analysis-reuse-mode`) If no filename is specified, x265_analysis.dat is used. .. option:: --analysis-reuse-level <1..10> @@ -1029,13 +1028,7 @@ Level 4 - uses the depth of the neighbouring/ co-located CUs TU depth to limit the 1st subTU depth. The 1st subTU depth is taken as the limiting depth for the other subTUs. - - Enabling levels 3 or 4 may cause a mismatch in the output bitstreams - between option:`--analysis-save` and option:`--analysis-load` - as all neighbouring CUs TU depth may not be available in the - option:`--analysis-load` run as only the best mode's information is - available to it. - + Default: 0 .. option:: --nr-intra <integer>, --nr-inter <integer> @@ -1351,14 +1344,7 @@ This value represents the percentage difference between the inter cost and intra cost of a frame used in scenecut detection. For example, a value of 5 indicates, if the inter cost of a frame is greater than or equal to 95 percent of the intra cost of the frame, - then detect this frame as scenecut. Values between 5 and 15 are recommended. Default 5. - -.. option:: --radl <integer> - - Number of RADL pictures allowed infront of IDR. Requires fixed keyframe interval. - Recommended value is 2-3. Default 0 (disabled). - - **Range of values: Between 0 and `--bframes` + then detect this frame as scenecut. Values between 5 and 15 are recommended. Default 5. .. option:: --ctu-info <0, 1, 2, 4, 6> @@ -1387,16 +1373,6 @@ Default 20 **Range of values:** Between the maximum consecutive bframe count (:option:`--bframes`) and 250 -.. option:: --gop-lookahead <integer> - - Number of frames for GOP boundary decision lookahead. If a scenecut frame is found - within this from the gop boundary set by `--keyint`, the GOP will be extented until such a point, - otherwise the GOP will be terminated as set by `--keyint`. Default 0. - - **Range of values:** Between 0 and (`--rc-lookahead` - mini-GOP length) - - It is recommended to have `--gop-lookahaed` less than `--min-keyint` as scenecuts beyond - `--min-keyint` are already being coded as keyframes. .. option:: --lookahead-slices <0..16> @@ -2064,7 +2040,7 @@ Example for MaxCLL=1000 candela per square meter, MaxFALL=400 candela per square meter: - --max-cll "1000,400" + --max-cll 1000,400 Note that this string value will need to be escaped or quoted to protect against shell expansion on many platforms. No default.
View file
x265_2.7.tar.gz/doc/reST/releasenotes.rst -> x265_2.6.tar.gz/doc/reST/releasenotes.rst
Changed
@@ -2,32 +2,6 @@ Release Notes ************* -Version 2.7 -=========== - -Release date - 21st Feb, 2018. - -New features ------------- -1. :option:`--gop-lookahead` can be used to extend the gop boundary(set by `--keyint`). The GOP will be extended, if a scene-cut frame is found within this many number of frames. -2. Support for RADL pictures added in x265. - :option:`--radl` can be used to decide number of RADL pictures preceding the IDR picture. - -Encoder enhancements --------------------- -1. Moved from YASM to NASM assembler. Supports NASM assembler version 2.13 and greater. -2. Enable analysis save and load in a single run. Introduces two new cli options `--analysis-save <filename>` and `--analysis-load <filename>`. -3. Comply to HDR10+ LLC specification. -4. Reduced x265 build time by more than 50% by re-factoring ipfilter.asm. - -Bug fixes ---------- -1. Fixed inconsistent output issue in deblock filter and --const-vbv. -2. Fixed Mac OS build warnings. -3. Fixed inconsistency in pass-2 when weightp and cutree are enabled. -4. Fixed deadlock issue due to dropping of BREF frames, while forcing slice types through qp file. - - Version 2.6 ===========
View file
x265_2.7.tar.gz/source/CMakeLists.txt -> x265_2.6.tar.gz/source/CMakeLists.txt
Changed
@@ -29,7 +29,7 @@ option(STATIC_LINK_CRT "Statically link C runtime for release builds" OFF) mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD) # X265_BUILD must be incremented each time the public API is changed -set(X265_BUILD 151) +set(X265_BUILD 146) configure_file("${PROJECT_SOURCE_DIR}/x265.def.in" "${PROJECT_BINARY_DIR}/x265.def") configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in" @@ -323,15 +323,15 @@ execute_process(COMMAND ${CMAKE_CXX_COMPILER} -dumpversion OUTPUT_VARIABLE CC_VERSION) endif(GCC) -find_package(Nasm) +find_package(Yasm) if(ARM OR CROSS_COMPILE_ARM) option(ENABLE_ASSEMBLY "Enable use of assembly coded primitives" ON) -elseif(NASM_FOUND AND X86) - if (NASM_VERSION_STRING VERSION_LESS "2.13.0") - message(STATUS "Nasm version ${NASM_VERSION_STRING} is too old. 2.13.0 or later required") +elseif(YASM_FOUND AND X86) + if (YASM_VERSION_STRING VERSION_LESS "1.2.0") + message(STATUS "Yasm version ${YASM_VERSION_STRING} is too old. 1.2.0 or later required") option(ENABLE_ASSEMBLY "Enable use of assembly coded primitives" OFF) else() - message(STATUS "Found Nasm ${NASM_VERSION_STRING} to build assembly primitives") + message(STATUS "Found Yasm ${YASM_VERSION_STRING} to build assembly primitives") option(ENABLE_ASSEMBLY "Enable use of assembly coded primitives" ON) endif() else() @@ -517,18 +517,18 @@ list(APPEND ASM_OBJS ${ASM}.${SUFFIX}) add_custom_command( OUTPUT ${ASM}.${SUFFIX} - COMMAND ${NASM_EXECUTABLE} ARGS ${NASM_FLAGS} ${ASM_SRC} -o ${ASM}.${SUFFIX} + COMMAND ${YASM_EXECUTABLE} ARGS ${YASM_FLAGS} ${ASM_SRC} -o ${ASM}.${SUFFIX} DEPENDS ${ASM_SRC}) endforeach() endif() endif() source_group(ASM FILES ${ASM_SRCS}) if(ENABLE_HDR10_PLUS) - add_library(x265-static STATIC $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common> $<TARGET_OBJECTS:dynamicHDR10> ${ASM_OBJS}) + add_library(x265-static STATIC $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common> $<TARGET_OBJECTS:dynamicHDR10> ${ASM_OBJS} ${ASM_SRCS}) add_library(hdr10plus-static STATIC $<TARGET_OBJECTS:dynamicHDR10>) set_target_properties(hdr10plus-static PROPERTIES OUTPUT_NAME hdr10plus) else() - add_library(x265-static STATIC $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common> ${ASM_OBJS}) + add_library(x265-static STATIC $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common> ${ASM_OBJS} ${ASM_SRCS}) endif() if(NOT MSVC) set_target_properties(x265-static PROPERTIES OUTPUT_NAME x265) @@ -546,19 +546,14 @@ ARCHIVE DESTINATION ${LIB_INSTALL_DIR}) endif() install(FILES x265.h "${PROJECT_BINARY_DIR}/x265_config.h" DESTINATION include) + if(WIN32) - if(MSVC_IDE) - install(FILES "${PROJECT_BINARY_DIR}/Debug/x265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS Debug) - install(FILES "${PROJECT_BINARY_DIR}/RelWithDebInfo/x265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS RelWithDebInfo) - install(FILES "${PROJECT_BINARY_DIR}/Debug/libx265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS Debug OPTIONAL NAMELINK_ONLY) - install(FILES "${PROJECT_BINARY_DIR}/RelWithDebInfo/libx265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS RelWithDebInfo OPTIONAL NAMELINK_ONLY) - else() - install(FILES "${PROJECT_BINARY_DIR}/x265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS Debug) - install(FILES "${PROJECT_BINARY_DIR}/x265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS RelWithDebInfo) - install(FILES "${PROJECT_BINARY_DIR}/libx265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS Debug OPTIONAL NAMELINK_ONLY) - install(FILES "${PROJECT_BINARY_DIR}/libx265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS RelWithDebInfo OPTIONAL NAMELINK_ONLY) - endif() + install(FILES "${PROJECT_BINARY_DIR}/Debug/x265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS Debug) + install(FILES "${PROJECT_BINARY_DIR}/RelWithDebInfo/x265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS RelWithDebInfo) + install(FILES "${PROJECT_BINARY_DIR}/Debug/libx265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS Debug OPTIONAL NAMELINK_ONLY) + install(FILES "${PROJECT_BINARY_DIR}/RelWithDebInfo/libx265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS RelWithDebInfo OPTIONAL NAMELINK_ONLY) endif() + if(CMAKE_RC_COMPILER) # The resource compiler does not need CFLAGS or macro defines. It # often breaks them @@ -647,9 +642,7 @@ endforeach() if(PLIBLIST) # blacklist of libraries that should not be in Libs.private - list(REMOVE_ITEM PLIBLIST "-lc" "-lpthread" "-lmingwex" "-lmingwthrd" - "-lmingw32" "-lmoldname" "-lmsvcrt" "-ladvapi32" "-lshell32" - "-luser32" "-lkernel32") + list(REMOVE_ITEM PLIBLIST "-lc" "-lpthread") string(REPLACE ";" " " PRIVATE_LIBS "${PLIBLIST}") else() set(PRIVATE_LIBS "") @@ -693,11 +686,11 @@ if(ENABLE_HDR10_PLUS) add_executable(cli ../COPYING ${InputFiles} ${OutputFiles} ${GETOPT} x265.cpp x265.h x265cli.h - $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common> $<TARGET_OBJECTS:dynamicHDR10> ${ASM_OBJS}) + $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common> $<TARGET_OBJECTS:dynamicHDR10> ${ASM_OBJS} ${ASM_SRCS}) else() add_executable(cli ../COPYING ${InputFiles} ${OutputFiles} ${GETOPT} x265.cpp x265.h x265cli.h - $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common> ${ASM_OBJS}) + $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common> ${ASM_OBJS} ${ASM_SRCS}) endif() else() add_executable(cli ../COPYING ${InputFiles} ${OutputFiles} ${GETOPT} ${X265_RC_FILE}
View file
x265_2.6.tar.gz/source/cmake/CMakeASM_YASMInformation.cmake
Added
@@ -0,0 +1,68 @@ +set(ASM_DIALECT "_YASM") +set(CMAKE_ASM${ASM_DIALECT}_SOURCE_FILE_EXTENSIONS asm) + +if(X64) + list(APPEND ASM_FLAGS -DARCH_X86_64=1) + if(ENABLE_PIC) + list(APPEND ASM_FLAGS -DPIC) + endif() + if(APPLE) + set(ARGS -f macho64 -m amd64 -DPREFIX) + elseif(UNIX AND NOT CYGWIN) + set(ARGS -f elf64 -m amd64) + else() + set(ARGS -f win64 -m amd64) + endif() +else() + list(APPEND ASM_FLAGS -DARCH_X86_64=0) + if(APPLE) + set(ARGS -f macho -DPREFIX) + elseif(UNIX AND NOT CYGWIN) + set(ARGS -f elf32) + else() + set(ARGS -f win32 -DPREFIX) + endif() +endif() + +if(GCC) + list(APPEND ASM_FLAGS -DHAVE_ALIGNED_STACK=1) +else() + list(APPEND ASM_FLAGS -DHAVE_ALIGNED_STACK=0) +endif() + +if(HIGH_BIT_DEPTH) + if(MAIN12) + list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=12 -DX265_NS=${X265_NS}) + else() + list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=10 -DX265_NS=${X265_NS}) + endif() +else() + list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=0 -DBIT_DEPTH=8 -DX265_NS=${X265_NS}) +endif() + +list(APPEND ASM_FLAGS "${CMAKE_ASM_YASM_FLAGS}") + +if(CMAKE_BUILD_TYPE MATCHES Release) + list(APPEND ASM_FLAGS "${CMAKE_ASM_YASM_FLAGS_RELEASE}") +elseif(CMAKE_BUILD_TYPE MATCHES Debug) + list(APPEND ASM_FLAGS "${CMAKE_ASM_YASM_FLAGS_DEBUG}") +elseif(CMAKE_BUILD_TYPE MATCHES MinSizeRel) + list(APPEND ASM_FLAGS "${CMAKE_ASM_YASM_FLAGS_MINSIZEREL}") +elseif(CMAKE_BUILD_TYPE MATCHES RelWithDebInfo) + list(APPEND ASM_FLAGS "${CMAKE_ASM_YASM_FLAGS_RELWITHDEBINFO}") +endif() + +set(YASM_FLAGS ${ARGS} ${ASM_FLAGS} PARENT_SCOPE) +string(REPLACE ";" " " CMAKE_ASM_YASM_COMPILER_ARG1 "${ARGS}") + +# This section exists to override the one in CMakeASMInformation.cmake +# (the default Information file). This removes the <FLAGS> +# thing so that your C compiler flags that have been set via +# set_target_properties don't get passed to yasm and confuse it. +if(NOT CMAKE_ASM${ASM_DIALECT}_COMPILE_OBJECT) + string(REPLACE ";" " " STR_ASM_FLAGS "${ASM_FLAGS}") + set(CMAKE_ASM${ASM_DIALECT}_COMPILE_OBJECT "<CMAKE_ASM${ASM_DIALECT}_COMPILER> ${STR_ASM_FLAGS} -o <OBJECT> <SOURCE>") +endif() + +include(CMakeASMInformation) +set(ASM_DIALECT)
View file
x265_2.6.tar.gz/source/cmake/CMakeDetermineASM_YASMCompiler.cmake
Added
@@ -0,0 +1,5 @@ +set(ASM_DIALECT "_YASM") +set(CMAKE_ASM${ASM_DIALECT}_COMPILER ${YASM_EXECUTABLE}) +set(CMAKE_ASM${ASM_DIALECT}_COMPILER_INIT ${_CMAKE_TOOLCHAIN_PREFIX}yasm) +include(CMakeDetermineASMCompiler) +set(ASM_DIALECT)
View file
x265_2.6.tar.gz/source/cmake/CMakeTestASM_YASMCompiler.cmake
Added
@@ -0,0 +1,3 @@ +set(ASM_DIALECT "_YASM") +include(CMakeTestASMCompiler) +set(ASM_DIALECT)
View file
x265_2.6.tar.gz/source/cmake/FindYasm.cmake
Added
@@ -0,0 +1,25 @@ +include(FindPackageHandleStandardArgs) + +# Simple path search with YASM_ROOT environment variable override +find_program(YASM_EXECUTABLE + NAMES yasm yasm-1.2.0-win32 yasm-1.2.0-win64 yasm yasm-1.3.0-win32 yasm-1.3.0-win64 + HINTS $ENV{YASM_ROOT} ${YASM_ROOT} + PATH_SUFFIXES bin +) + +if(YASM_EXECUTABLE) + execute_process(COMMAND ${YASM_EXECUTABLE} --version + OUTPUT_VARIABLE yasm_version + ERROR_QUIET + OUTPUT_STRIP_TRAILING_WHITESPACE + ) + if(yasm_version MATCHES "^yasm ([0-9\\.]*)") + set(YASM_VERSION_STRING "${CMAKE_MATCH_1}") + endif() + unset(yasm_version) +endif() + +# Provide standardized success/failure messages +find_package_handle_standard_args(yasm + REQUIRED_VARS YASM_EXECUTABLE + VERSION_VAR YASM_VERSION_STRING)
View file
x265_2.7.tar.gz/source/cmake/version.cmake -> x265_2.6.tar.gz/source/cmake/version.cmake
Changed
@@ -22,11 +22,12 @@ set(hg_${key} ${value}) endforeach() if(DEFINED hg_tag) + set(X265_VERSION ${hg_tag}) set(X265_LATEST_TAG ${hg_tag}) + set(X265_TAG_DISTANCE "0") elseif(DEFINED hg_node) - set(X265_LATEST_TAG ${hg_latesttag}) - set(X265_TAG_DISTANCE ${hg_latesttagdistance}) - string(SUBSTRING "${hg_node}" 0 12 X265_REVISION_ID) + string(SUBSTRING "${hg_node}" 0 16 hg_id) + set(X265_VERSION "${hg_latesttag}+${hg_latesttagdistance}-${hg_id}") endif() elseif(HG_EXECUTABLE AND EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/../.hg) if(EXISTS "${HG_EXECUTABLE}.bat")
View file
x265_2.7.tar.gz/source/common/CMakeLists.txt -> x265_2.6.tar.gz/source/common/CMakeLists.txt
Changed
@@ -56,26 +56,28 @@ endif() set(VEC_PRIMITIVES vec/vec-primitives.cpp ${PRIMITIVES}) source_group(Intrinsics FILES ${VEC_PRIMITIVES}) + set(C_SRCS asm-primitives.cpp pixel.h mc.h ipfilter8.h blockcopy8.h dct8.h loopfilter.h seaintegral.h) set(A_SRCS pixel-a.asm const-a.asm cpu-a.asm ssd-a.asm mc-a.asm mc-a2.asm pixel-util8.asm blockcopy8.asm pixeladd8.asm dct8.asm seaintegral.asm) if(HIGH_BIT_DEPTH) - set(A_SRCS ${A_SRCS} sad16-a.asm intrapred16.asm v4-ipfilter16.asm h4-ipfilter16.asm h-ipfilter16.asm ipfilter16.asm loopfilter.asm) + set(A_SRCS ${A_SRCS} sad16-a.asm intrapred16.asm ipfilter16.asm loopfilter.asm) else() - set(A_SRCS ${A_SRCS} sad-a.asm intrapred8.asm intrapred8_allangs.asm v4-ipfilter8.asm h-ipfilter8.asm ipfilter8.asm loopfilter.asm) + set(A_SRCS ${A_SRCS} sad-a.asm intrapred8.asm intrapred8_allangs.asm ipfilter8.asm loopfilter.asm) endif() + if(NOT X64) set(A_SRCS ${A_SRCS} pixel-32.asm) endif() if(MSVC_IDE OR XCODE) - # MSVC requires custom build rules in the main cmake script for nasm - set(MSVC_ASMS "${A_SRCS}" CACHE INTERNAL "nasm sources") + # MSVC requires custom build rules in the main cmake script for yasm + set(MSVC_ASMS "${A_SRCS}" CACHE INTERNAL "yasm sources") set(A_SRCS) endif() - enable_language(ASM_NASM) + enable_language(ASM_YASM) foreach(SRC ${A_SRCS} ${C_SRCS}) set(ASM_PRIMITIVES ${ASM_PRIMITIVES} x86/${SRC})
View file
x265_2.7.tar.gz/source/common/common.h -> x265_2.6.tar.gz/source/common/common.h
Changed
@@ -75,10 +75,11 @@ #define ALIGN_VAR_8(T, var) T var __attribute__((aligned(8))) #define ALIGN_VAR_16(T, var) T var __attribute__((aligned(16))) #define ALIGN_VAR_32(T, var) T var __attribute__((aligned(32))) + #if defined(__MINGW32__) #define fseeko fseeko64 -#define ftello ftello64 #endif + #elif defined(_MSC_VER) #define ALIGN_VAR_4(T, var) __declspec(align(4)) T var @@ -86,8 +87,9 @@ #define ALIGN_VAR_16(T, var) __declspec(align(16)) T var #define ALIGN_VAR_32(T, var) __declspec(align(32)) T var #define fseeko _fseeki64 -#define ftello _ftelli64 + #endif // if defined(__GNUC__) + #if HAVE_INT_TYPES_H #define __STDC_FORMAT_MACROS #include <inttypes.h>
View file
x265_2.7.tar.gz/source/common/cudata.cpp -> x265_2.6.tar.gz/source/common/cudata.cpp
Changed
@@ -1626,7 +1626,7 @@ dir |= (1 << list); candMvField[count][list].mv = colmv; candMvField[count][list].refIdx = refIdx; - if (m_encData->m_param->scaleFactor && m_encData->m_param->analysisSave && m_log2CUSize[0] < 4) + if (m_encData->m_param->scaleFactor && m_encData->m_param->analysisReuseMode == X265_ANALYSIS_SAVE && m_log2CUSize[0] < 4) { MV dist(MAX_MV, MAX_MV); candMvField[count][list].mv = dist; @@ -1791,7 +1791,7 @@ int curRefPOC = m_slice->m_refPOCList[picList][refIdx]; int curPOC = m_slice->m_poc; - if (m_encData->m_param->scaleFactor && m_encData->m_param->analysisSave && (m_log2CUSize[0] < 4)) + if (m_encData->m_param->scaleFactor && m_encData->m_param->analysisReuseMode == X265_ANALYSIS_SAVE && (m_log2CUSize[0] < 4)) { MV dist(MAX_MV, MAX_MV); pmv[numMvc++] = amvpCand[num++] = dist;
View file
x265_2.7.tar.gz/source/common/deblock.cpp -> x265_2.6.tar.gz/source/common/deblock.cpp
Changed
@@ -207,18 +207,21 @@ static const MV zeroMv(0, 0); const Slice* const sliceQ = cuQ->m_slice; const Slice* const sliceP = cuP->m_slice; - const Frame* refP0 = (cuP->m_refIdx[0][partP] >= 0) ? sliceP->m_refFrameList[0][cuP->m_refIdx[0][partP]] : NULL; - const Frame* refQ0 = (cuQ->m_refIdx[0][partQ] >= 0) ? sliceQ->m_refFrameList[0][cuQ->m_refIdx[0][partQ]] : NULL; + + const Frame* refP0 = sliceP->m_refFrameList[0][cuP->m_refIdx[0][partP]]; + const Frame* refQ0 = sliceQ->m_refFrameList[0][cuQ->m_refIdx[0][partQ]]; const MV& mvP0 = refP0 ? cuP->m_mv[0][partP] : zeroMv; const MV& mvQ0 = refQ0 ? cuQ->m_mv[0][partQ] : zeroMv; + if (sliceQ->isInterP() && sliceP->isInterP()) { return ((refP0 != refQ0) || (abs(mvQ0.x - mvP0.x) >= 4) || (abs(mvQ0.y - mvP0.y) >= 4)) ? 1 : 0; } + // (sliceQ->isInterB() || sliceP->isInterB()) - const Frame* refP1 = (cuP->m_refIdx[1][partP] >= 0) ? sliceP->m_refFrameList[1][cuP->m_refIdx[1][partP]] : NULL; - const Frame* refQ1 = (cuQ->m_refIdx[1][partQ] >= 0) ? sliceQ->m_refFrameList[1][cuQ->m_refIdx[1][partQ]] : NULL; + const Frame* refP1 = sliceP->m_refFrameList[1][cuP->m_refIdx[1][partP]]; + const Frame* refQ1 = sliceQ->m_refFrameList[1][cuQ->m_refIdx[1][partQ]]; const MV& mvP1 = refP1 ? cuP->m_mv[1][partP] : zeroMv; const MV& mvQ1 = refQ1 ? cuQ->m_mv[1][partQ] : zeroMv;
View file
x265_2.7.tar.gz/source/common/frame.h -> x265_2.6.tar.gz/source/common/frame.h
Changed
@@ -98,6 +98,7 @@ float* m_quantOffsets; // points to quantOffsets in x265_picture x265_sei m_userSEI; + Event m_reconEncoded; /* Frame Parallelism - notification between FrameEncoders of available motion reference rows */ ThreadSafeInteger* m_reconRowFlag; // flag of CTU rows completely reconstructed and extended for motion reference
View file
x265_2.7.tar.gz/source/common/framedata.cpp -> x265_2.6.tar.gz/source/common/framedata.cpp
Changed
@@ -40,12 +40,11 @@ m_spsrpsIdx = -1; if (param.rc.bStatWrite) m_spsrps = const_cast<RPS*>(sps.spsrps); - bool isallocated = m_cuMemPool.create(0, param.internalCsp, sps.numCUsInFrame, param); - if (isallocated) - for (uint32_t ctuAddr = 0; ctuAddr < sps.numCUsInFrame; ctuAddr++) - m_picCTU[ctuAddr].initialize(m_cuMemPool, 0, param, ctuAddr); - else - return false; + + m_cuMemPool.create(0, param.internalCsp, sps.numCUsInFrame, param); + for (uint32_t ctuAddr = 0; ctuAddr < sps.numCUsInFrame; ctuAddr++) + m_picCTU[ctuAddr].initialize(m_cuMemPool, 0, param, ctuAddr); + CHECKED_MALLOC_ZERO(m_cuStat, RCStatCU, sps.numCUsInFrame); CHECKED_MALLOC(m_rowStat, RCStatRow, sps.numCuInHeight); reinit(sps); @@ -77,12 +76,16 @@ X265_FREE(m_cuStat); X265_FREE(m_rowStat); - for (int i = 0; i < INTEGRAL_PLANE_NUM; i++) + + if (m_meBuffer) { - if (m_meBuffer[i] != NULL) + for (int i = 0; i < INTEGRAL_PLANE_NUM; i++) { - X265_FREE(m_meBuffer[i]); - m_meBuffer[i] = NULL; + if (m_meBuffer[i] != NULL) + { + X265_FREE(m_meBuffer[i]); + m_meBuffer[i] = NULL; + } } } }
View file
x265_2.7.tar.gz/source/common/lowres.cpp -> x265_2.6.tar.gz/source/common/lowres.cpp
Changed
@@ -89,7 +89,7 @@ } } - for (int i = 0; i < bframes + 2; i++) + for (int i = 0; i < bframes + 1; i++) { CHECKED_MALLOC(lowresMvs[0][i], MV, cuCount); CHECKED_MALLOC(lowresMvs[1][i], MV, cuCount); @@ -118,7 +118,7 @@ } } - for (int i = 0; i < bframes + 2; i++) + for (int i = 0; i < bframes + 1; i++) { X265_FREE(lowresMvs[0][i]); X265_FREE(lowresMvs[1][i]); @@ -152,7 +152,7 @@ for (int x = 0; x < bframes + 2; x++) rowSatds[y][x][0] = -1; - for (int i = 0; i < bframes + 2; i++) + for (int i = 0; i < bframes + 1; i++) { lowresMvs[0][i][0].x = 0x7FFF; lowresMvs[1][i][0].x = 0x7FFF;
View file
x265_2.7.tar.gz/source/common/lowres.h -> x265_2.6.tar.gz/source/common/lowres.h
Changed
@@ -129,9 +129,9 @@ uint8_t* intraMode; int64_t satdCost; uint16_t* lowresCostForRc; - uint16_t* lowresCosts[X265_BFRAME_MAX + 2][X265_BFRAME_MAX + 2]; - int32_t* lowresMvCosts[2][X265_BFRAME_MAX + 2]; - MV* lowresMvs[2][X265_BFRAME_MAX + 2]; + uint16_t(*lowresCosts[X265_BFRAME_MAX + 2][X265_BFRAME_MAX + 2]); + int32_t* lowresMvCosts[2][X265_BFRAME_MAX + 1]; + MV* lowresMvs[2][X265_BFRAME_MAX + 1]; uint32_t maxBlocksInRow; uint32_t maxBlocksInCol; uint32_t maxBlocksInRowFullRes;
View file
x265_2.7.tar.gz/source/common/param.cpp -> x265_2.6.tar.gz/source/common/param.cpp
Changed
@@ -144,7 +144,6 @@ /* Coding Structure */ param->keyframeMin = 0; param->keyframeMax = 250; - param->gopLookahead = 0; param->bOpenGOP = 1; param->bframes = 4; param->lookaheadDepth = 20; @@ -154,7 +153,6 @@ param->lookaheadSlices = 8; param->lookaheadThreads = 0; param->scenecutBias = 5.0; - param->radl = 0; /* Intra Coding Tools */ param->bEnableConstrainedIntra = 0; param->bEnableStrongIntraSmoothing = 1; @@ -198,12 +196,10 @@ param->rdPenalty = 0; param->psyRd = 2.0; param->psyRdoq = 0.0; - param->analysisReuseMode = 0; /*DEPRECATED*/ + param->analysisReuseMode = 0; param->analysisMultiPassRefine = 0; param->analysisMultiPassDistortion = 0; param->analysisReuseFileName = NULL; - param->analysisSave = NULL; - param->analysisLoad = NULL; param->bIntraInBFrames = 0; param->bLossless = 0; param->bCULossless = 0; @@ -853,7 +849,7 @@ p->rc.bStrictCbr = atobool(value); p->rc.pbFactor = 1.0; } - OPT("analysis-reuse-mode") p->analysisReuseMode = parseName(value, x265_analysis_names, bError); /*DEPRECATED*/ + OPT("analysis-reuse-mode") p->analysisReuseMode = parseName(value, x265_analysis_names, bError); OPT("sar") { p->vui.aspectRatioIdc = parseName(value, x265_sar_names, bError); @@ -1008,10 +1004,6 @@ bError = true; } } - OPT("gop-lookahead") p->gopLookahead = atoi(value); - OPT("analysis-save") p->analysisSave = strdup(value); - OPT("analysis-load") p->analysisLoad = strdup(value); - OPT("radl") p->radl = atoi(value); else return X265_PARAM_BAD_NAME; } @@ -1318,14 +1310,10 @@ "scenecutThreshold must be greater than 0"); CHECK(param->scenecutBias < 0 || 100 < param->scenecutBias, "scenecut-bias must be between 0 and 100"); - CHECK(param->radl < 0 || param->radl > param->bframes, - "radl must be between 0 and bframes"); CHECK(param->rdPenalty < 0 || param->rdPenalty > 2, "Valid penalty for 32x32 intra TU in non-I slices. 0:disabled 1:RD-penalty 2:maximum"); CHECK(param->keyframeMax < -1, "Invalid max IDR period in frames. value should be greater than -1"); - CHECK(param->gopLookahead < -1, - "GOP lookahead must be greater than -1"); CHECK(param->decodedPictureHashSEI < 0 || param->decodedPictureHashSEI > 3, "Invalid hash option. Decoded Picture Hash SEI 0: disabled, 1: MD5, 2: CRC, 3: Checksum"); CHECK(param->rc.vbvBufferSize < 0, @@ -1352,7 +1340,9 @@ "Constant QP is incompatible with 2pass"); CHECK(param->rc.bStrictCbr && (param->rc.bitrate <= 0 || param->rc.vbvBufferSize <=0), "Strict-cbr cannot be applied without specifying target bitrate or vbv bufsize"); - CHECK((param->analysisSave || param->analysisLoad) && (param->analysisReuseLevel < 1 || param->analysisReuseLevel > 10), + CHECK(param->analysisReuseMode && (param->analysisReuseMode < X265_ANALYSIS_OFF || param->analysisReuseMode > X265_ANALYSIS_LOAD), + "Invalid analysis mode. Analysis mode 0: OFF 1: SAVE : 2 LOAD"); + CHECK(param->analysisReuseMode && (param->analysisReuseLevel < 1 || param->analysisReuseLevel > 10), "Invalid analysis refine level. Value must be between 1 and 10 (inclusive)"); CHECK(param->scaleFactor > 2, "Invalid scale-factor. Supports factor <= 2"); CHECK(param->rc.qpMax < QP_MIN || param->rc.qpMax > QP_MAX_MAX, @@ -1530,15 +1520,11 @@ char *x265_param2string(x265_param* p, int padx, int pady) { char *buf, *s; - size_t bufSize = 4000 + p->rc.zoneCount * 64; - if (p->numaPools) - bufSize += strlen(p->numaPools); - if (p->masteringDisplayColorVolume) - bufSize += strlen(p->masteringDisplayColorVolume); - buf = s = X265_MALLOC(char, bufSize); + buf = s = X265_MALLOC(char, MAXPARAMSIZE); if (!buf) return NULL; + #define BOOL(param, cliopt) \ s += sprintf(s, " %s", (param) ? cliopt : "no-" cliopt); @@ -1553,7 +1539,7 @@ BOOL(p->bEnableSsim, "ssim"); s += sprintf(s, " log-level=%d", p->logLevel); if (p->csvfn) - s += sprintf(s, " csv csv-log-level=%d", p->csvLogLevel); + s += sprintf(s, " csvfn=%s csv-log-level=%d", p->csvfn, p->csvLogLevel); s += sprintf(s, " bitdepth=%d", p->internalBitDepth); s += sprintf(s, " input-csp=%d", p->internalCsp); s += sprintf(s, " fps=%u/%u", p->fpsNum, p->fpsDenom); @@ -1575,7 +1561,6 @@ BOOL(p->bOpenGOP, "open-gop"); s += sprintf(s, " min-keyint=%d", p->keyframeMin); s += sprintf(s, " keyint=%d", p->keyframeMax); - s += sprintf(s, " gop-lookahead=%d", p->gopLookahead); s += sprintf(s, " bframes=%d", p->bframes); s += sprintf(s, " b-adapt=%d", p->bFrameAdaptive); BOOL(p->bBPyramid, "b-pyramid"); @@ -1583,7 +1568,6 @@ s += sprintf(s, " rc-lookahead=%d", p->lookaheadDepth); s += sprintf(s, " lookahead-slices=%d", p->lookaheadSlices); s += sprintf(s, " scenecut=%d", p->scenecutThreshold); - s += sprintf(s, " radl=%d", p->radl); BOOL(p->bIntraRefresh, "intra-refresh"); s += sprintf(s, " ctu=%d", p->maxCUSize); s += sprintf(s, " min-cu-size=%d", p->minCUSize); @@ -1629,6 +1613,7 @@ s += sprintf(s, " psy-rd=%.2f", p->psyRd); s += sprintf(s, " psy-rdoq=%.2f", p->psyRdoq); BOOL(p->bEnableRdRefine, "rd-refine"); + s += sprintf(s, " analysis-reuse-mode=%d", p->analysisReuseMode); BOOL(p->bLossless, "lossless"); s += sprintf(s, " cbqpoffs=%d", p->cbQpOffset); s += sprintf(s, " crqpoffs=%d", p->crQpOffset); @@ -1726,10 +1711,6 @@ BOOL(p->bEmitHDRSEI, "hdr"); BOOL(p->bHDROpt, "hdr-opt"); BOOL(p->bDhdr10opt, "dhdr10-opt"); - if (p->analysisSave) - s += sprintf(s, " analysis-save"); - if (p->analysisLoad) - s += sprintf(s, " analysis-load"); s += sprintf(s, " analysis-reuse-level=%d", p->analysisReuseLevel); s += sprintf(s, " scale-factor=%d", p->scaleFactor); s += sprintf(s, " refine-intra=%d", p->intraRefine);
View file
x265_2.7.tar.gz/source/common/param.h -> x265_2.6.tar.gz/source/common/param.h
Changed
@@ -53,5 +53,8 @@ int x265_param_parse(x265_param *p, const char *name, const char *value); #define PARAM_NS X265_NS #endif + +#define MAXPARAMSIZE 2000 } + #endif // ifndef X265_PARAM_H
View file
x265_2.7.tar.gz/source/common/picyuv.cpp -> x265_2.6.tar.gz/source/common/picyuv.cpp
Changed
@@ -358,20 +358,18 @@ pixel *uPic = m_picOrg[1]; pixel *vPic = m_picOrg[2]; - if (param.csvLogLevel >= 2 || param.maxCLL || param.maxFALL) + for (int r = 0; r < height; r++) { - for (int r = 0; r < height; r++) + for (int c = 0; c < width; c++) { - for (int c = 0; c < width; c++) - { - m_maxLumaLevel = X265_MAX(yPic[c], m_maxLumaLevel); - m_minLumaLevel = X265_MIN(yPic[c], m_minLumaLevel); - lumaSum += yPic[c]; - } - yPic += m_stride; + m_maxLumaLevel = X265_MAX(yPic[c], m_maxLumaLevel); + m_minLumaLevel = X265_MIN(yPic[c], m_minLumaLevel); + lumaSum += yPic[c]; } - m_avgLumaLevel = (double)lumaSum / (m_picHeight * m_picWidth); + yPic += m_stride; } + m_avgLumaLevel = (double)lumaSum / (m_picHeight * m_picWidth); + if (param.csvLogLevel >= 2) { if (param.internalCsp != X265_CSP_I400)
View file
x265_2.7.tar.gz/source/common/x86/asm-primitives.cpp -> x265_2.6.tar.gz/source/common/x86/asm-primitives.cpp
Changed
@@ -116,6 +116,7 @@ #include "dct8.h" #include "seaintegral.h" } + #define ALL_LUMA_CU_TYPED(prim, fncdef, fname, cpu) \ p.cu[BLOCK_8x8].prim = fncdef PFX(fname ## _8x8_ ## cpu); \ p.cu[BLOCK_16x16].prim = fncdef PFX(fname ## _16x16_ ## cpu); \
View file
x265_2.7.tar.gz/source/common/x86/blockcopy8.asm -> x265_2.6.tar.gz/source/common/x86/blockcopy8.asm
Changed
@@ -3850,7 +3850,7 @@ mov r4d, %2/4 add r1, r1 add r3, r3 -.loop: +.loop movu m0, [r2] movu m1, [r2 + 16] movu m2, [r2 + 32] @@ -3905,7 +3905,7 @@ lea r5, [3 * r3] lea r6, [3 * r1] -.loop: +.loop movu m0, [r2] movu xm1, [r2 + 32] movu [r0], m0 @@ -5085,7 +5085,7 @@ pxor m4, m4 pxor m5, m5 -.loop: +.loop ; row 0 movu m0, [r1] movu m1, [r1 + 16] @@ -5196,7 +5196,7 @@ pxor m4, m4 pxor m5, m5 -.loop: +.loop ; row 0 movu m0, [r1] movu m1, [r1 + 16]
View file
x265_2.7.tar.gz/source/common/x86/intrapred8.asm -> x265_2.6.tar.gz/source/common/x86/intrapred8.asm
Changed
@@ -2148,7 +2148,7 @@ paddw m0, m1 packuswb m0, m0 - movd r2d, m0 + movd r2, m0 mov [r0], r2b shr r2, 8 mov [r0 + r1], r2b
View file
x265_2.7.tar.gz/source/common/x86/ipfilter16.asm -> x265_2.6.tar.gz/source/common/x86/ipfilter16.asm
Changed
@@ -47,10 +47,75 @@ SECTION_RODATA 32 +tab_c_32: times 8 dd 32 tab_c_524800: times 4 dd 524800 tab_c_n8192: times 8 dw -8192 pd_524800: times 8 dd 524800 +tab_Tm16: db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9 + +tab_ChromaCoeff: dw 0, 64, 0, 0 + dw -2, 58, 10, -2 + dw -4, 54, 16, -2 + dw -6, 46, 28, -4 + dw -4, 36, 36, -4 + dw -4, 28, 46, -6 + dw -2, 16, 54, -4 + dw -2, 10, 58, -2 + +const tab_ChromaCoeffV, times 8 dw 0, 64 + times 8 dw 0, 0 + + times 8 dw -2, 58 + times 8 dw 10, -2 + + times 8 dw -4, 54 + times 8 dw 16, -2 + + times 8 dw -6, 46 + times 8 dw 28, -4 + + times 8 dw -4, 36 + times 8 dw 36, -4 + + times 8 dw -4, 28 + times 8 dw 46, -6 + + times 8 dw -2, 16 + times 8 dw 54, -4 + + times 8 dw -2, 10 + times 8 dw 58, -2 + +tab_ChromaCoeffVer: times 8 dw 0, 64 + times 8 dw 0, 0 + + times 8 dw -2, 58 + times 8 dw 10, -2 + + times 8 dw -4, 54 + times 8 dw 16, -2 + + times 8 dw -6, 46 + times 8 dw 28, -4 + + times 8 dw -4, 36 + times 8 dw 36, -4 + + times 8 dw -4, 28 + times 8 dw 46, -6 + + times 8 dw -2, 16 + times 8 dw 54, -4 + + times 8 dw -2, 10 + times 8 dw 58, -2 + +tab_LumaCoeff: dw 0, 0, 0, 64, 0, 0, 0, 0 + dw -1, 4, -10, 58, 17, -5, 1, 0 + dw -1, 4, -11, 40, 40, -11, 4, -1 + dw 0, 1, -5, 17, 58, -10, 4, -1 + ALIGN 32 tab_LumaCoeffV: times 4 dw 0, 0 times 4 dw 0, 64 @@ -92,6 +157,14 @@ times 8 dw 58, -10 times 8 dw 4, -1 +const interp8_hps_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7 + +const interp8_hpp_shuf, db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9 + db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13 + +const interp8_hpp_shuf_new, db 0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 6, 7, 6, 7, 8, 9 + db 4, 5, 6, 7, 6, 7, 8, 9, 8, 9, 10, 11, 10, 11, 12, 13 + SECTION .text cextern pd_8 cextern pd_32 @@ -102,6 +175,255 @@ cextern pw_2000 cextern idct8_shuf2 +%macro FILTER_LUMA_HOR_4_sse2 1 + movu m4, [r0 + %1] ; m4 = src[0-7] + movu m5, [r0 + %1 + 2] ; m5 = src[1-8] + pmaddwd m4, m0 + pmaddwd m5, m0 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m4, m4, q3120 + pshufd m5, m5, q3120 + punpcklqdq m4, m5 + + movu m5, [r0 + %1 + 4] ; m5 = src[2-9] + movu m3, [r0 + %1 + 6] ; m3 = src[3-10] + pmaddwd m5, m0 + pmaddwd m3, m0 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m5, m5, q3120 + pshufd m3, m3, q3120 + punpcklqdq m5, m3 + + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m4, m4, q3120 + pshufd m5, m5, q3120 + punpcklqdq m4, m5 + paddd m4, m1 +%endmacro + +%macro FILTER_LUMA_HOR_8_sse2 1 + movu m4, [r0 + %1] ; m4 = src[0-7] + movu m5, [r0 + %1 + 2] ; m5 = src[1-8] + pmaddwd m4, m0 + pmaddwd m5, m0 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m4, m4, q3120 + pshufd m5, m5, q3120 + punpcklqdq m4, m5 + + movu m5, [r0 + %1 + 4] ; m5 = src[2-9] + movu m3, [r0 + %1 + 6] ; m3 = src[3-10] + pmaddwd m5, m0 + pmaddwd m3, m0 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m5, m5, q3120 + pshufd m3, m3, q3120 + punpcklqdq m5, m3 + + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m4, m4, q3120 + pshufd m5, m5, q3120 + punpcklqdq m4, m5 + paddd m4, m1 + + movu m5, [r0 + %1 + 8] ; m5 = src[4-11] + movu m6, [r0 + %1 + 10] ; m6 = src[5-12] + pmaddwd m5, m0 + pmaddwd m6, m0 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m2, m6, q2301 + paddd m6, m2 + pshufd m5, m5, q3120 + pshufd m6, m6, q3120 + punpcklqdq m5, m6 + + movu m6, [r0 + %1 + 12] ; m6 = src[6-13] + movu m3, [r0 + %1 + 14] ; m3 = src[7-14] + pmaddwd m6, m0 + pmaddwd m3, m0 + pshufd m2, m6, q2301 + paddd m6, m2 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m6, m6, q3120 + pshufd m3, m3, q3120 + punpcklqdq m6, m3 + + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m2, m6, q2301 + paddd m6, m2 + pshufd m5, m5, q3120 + pshufd m6, m6, q3120 + punpcklqdq m5, m6 + paddd m5, m1 +%endmacro + +;------------------------------------------------------------------------------------------------------------ +; void interp_8tap_horiz_p%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------ +%macro FILTER_HOR_LUMA_sse2 3 +INIT_XMM sse2 +cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 8 + mov r4d, r4m + sub r0, 6 + shl r4d, 4 + add r1d, r1d + add r3d, r3d + +%ifdef PIC + lea r6, [tab_LumaCoeff] + mova m0, [r6 + r4] +%else + mova m0, [tab_LumaCoeff + r4] +%endif + +%ifidn %3, pp + mova m1, [pd_32] + pxor m7, m7 +%else + mova m1, [INTERP_OFFSET_PS] +%endif + + mov r4d, %2 +%ifidn %3, ps + cmp r5m, byte 0 + je .loopH + lea r6, [r1 + 2 * r1] + sub r0, r6 + add r4d, 7 +%endif + +.loopH: +%assign x 0 +%rep %1/8 + FILTER_LUMA_HOR_8_sse2 x + +%ifidn %3, pp + psrad m4, 6 + psrad m5, 6 + packssdw m4, m5 + CLIPW m4, m7, [pw_pixel_max] +%else + %if BIT_DEPTH == 10 + psrad m4, 2 + psrad m5, 2 + %elif BIT_DEPTH == 12 + psrad m4, 4 + psrad m5, 4 + %endif + packssdw m4, m5 +%endif + + movu [r2 + x], m4 +%assign x x+16 +%endrep + +%rep (%1 % 8)/4 + FILTER_LUMA_HOR_4_sse2 x + +%ifidn %3, pp + psrad m4, 6 + packssdw m4, m4 + CLIPW m4, m7, [pw_pixel_max] +%else + %if BIT_DEPTH == 10 + psrad m4, 2 + %elif BIT_DEPTH == 12 + psrad m4, 4 + %endif + packssdw m4, m4 +%endif + + movh [r2 + x], m4 +%endrep + + add r0, r1 + add r2, r3 + + dec r4d + jnz .loopH + RET + +%endmacro + +;------------------------------------------------------------------------------------------------------------ +; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------ + FILTER_HOR_LUMA_sse2 4, 4, pp + FILTER_HOR_LUMA_sse2 4, 8, pp + FILTER_HOR_LUMA_sse2 4, 16, pp + FILTER_HOR_LUMA_sse2 8, 4, pp + FILTER_HOR_LUMA_sse2 8, 8, pp + FILTER_HOR_LUMA_sse2 8, 16, pp + FILTER_HOR_LUMA_sse2 8, 32, pp + FILTER_HOR_LUMA_sse2 12, 16, pp + FILTER_HOR_LUMA_sse2 16, 4, pp + FILTER_HOR_LUMA_sse2 16, 8, pp + FILTER_HOR_LUMA_sse2 16, 12, pp + FILTER_HOR_LUMA_sse2 16, 16, pp + FILTER_HOR_LUMA_sse2 16, 32, pp + FILTER_HOR_LUMA_sse2 16, 64, pp + FILTER_HOR_LUMA_sse2 24, 32, pp + FILTER_HOR_LUMA_sse2 32, 8, pp + FILTER_HOR_LUMA_sse2 32, 16, pp + FILTER_HOR_LUMA_sse2 32, 24, pp + FILTER_HOR_LUMA_sse2 32, 32, pp + FILTER_HOR_LUMA_sse2 32, 64, pp + FILTER_HOR_LUMA_sse2 48, 64, pp + FILTER_HOR_LUMA_sse2 64, 16, pp + FILTER_HOR_LUMA_sse2 64, 32, pp + FILTER_HOR_LUMA_sse2 64, 48, pp + FILTER_HOR_LUMA_sse2 64, 64, pp + +;--------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;--------------------------------------------------------------------------------------------------------------------------- + FILTER_HOR_LUMA_sse2 4, 4, ps + FILTER_HOR_LUMA_sse2 4, 8, ps + FILTER_HOR_LUMA_sse2 4, 16, ps + FILTER_HOR_LUMA_sse2 8, 4, ps + FILTER_HOR_LUMA_sse2 8, 8, ps + FILTER_HOR_LUMA_sse2 8, 16, ps + FILTER_HOR_LUMA_sse2 8, 32, ps + FILTER_HOR_LUMA_sse2 12, 16, ps + FILTER_HOR_LUMA_sse2 16, 4, ps + FILTER_HOR_LUMA_sse2 16, 8, ps + FILTER_HOR_LUMA_sse2 16, 12, ps + FILTER_HOR_LUMA_sse2 16, 16, ps + FILTER_HOR_LUMA_sse2 16, 32, ps + FILTER_HOR_LUMA_sse2 16, 64, ps + FILTER_HOR_LUMA_sse2 24, 32, ps + FILTER_HOR_LUMA_sse2 32, 8, ps + FILTER_HOR_LUMA_sse2 32, 16, ps + FILTER_HOR_LUMA_sse2 32, 24, ps + FILTER_HOR_LUMA_sse2 32, 32, ps + FILTER_HOR_LUMA_sse2 32, 64, ps + FILTER_HOR_LUMA_sse2 48, 64, ps + FILTER_HOR_LUMA_sse2 64, 16, ps + FILTER_HOR_LUMA_sse2 64, 32, ps + FILTER_HOR_LUMA_sse2 64, 48, ps + FILTER_HOR_LUMA_sse2 64, 64, ps + %macro PROCESS_LUMA_VER_W4_4R_sse2 0 movq m0, [r0] movq m1, [r0 + r1] @@ -301,6 +623,5270 @@ FILTER_VER_LUMA_sse2 ps, 64, 16 FILTER_VER_LUMA_sse2 ps, 16, 64 +%macro FILTERH_W2_4_sse3 2 + movh m3, [r0 + %1] + movhps m3, [r0 + %1 + 2] + pmaddwd m3, m0 + movh m4, [r0 + r1 + %1] + movhps m4, [r0 + r1 + %1 + 2] + pmaddwd m4, m0 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m3, m3, q3120 + pshufd m4, m4, q3120 + punpcklqdq m3, m4 + paddd m3, m1 + movh m5, [r0 + 2 * r1 + %1] + movhps m5, [r0 + 2 * r1 + %1 + 2] + pmaddwd m5, m0 + movh m4, [r0 + r4 + %1] + movhps m4, [r0 + r4 + %1 + 2] + pmaddwd m4, m0 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m5, m5, q3120 + pshufd m4, m4, q3120 + punpcklqdq m5, m4 + paddd m5, m1 +%ifidn %2, pp + psrad m3, 6 + psrad m5, 6 + packssdw m3, m5 + CLIPW m3, m7, m6 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movd [r2 + %1], m3 + psrldq m3, 4 + movd [r2 + r3 + %1], m3 + psrldq m3, 4 + movd [r2 + r3 * 2 + %1], m3 + psrldq m3, 4 + movd [r2 + r5 + %1], m3 +%endmacro + +%macro FILTERH_W2_3_sse3 1 + movh m3, [r0 + %1] + movhps m3, [r0 + %1 + 2] + pmaddwd m3, m0 + movh m4, [r0 + r1 + %1] + movhps m4, [r0 + r1 + %1 + 2] + pmaddwd m4, m0 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m3, m3, q3120 + pshufd m4, m4, q3120 + punpcklqdq m3, m4 + paddd m3, m1 + + movh m5, [r0 + 2 * r1 + %1] + movhps m5, [r0 + 2 * r1 + %1 + 2] + pmaddwd m5, m0 + + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m5, m5, q3120 + paddd m5, m1 + + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 + + movd [r2 + %1], m3 + psrldq m3, 4 + movd [r2 + r3 + %1], m3 + psrldq m3, 4 + movd [r2 + r3 * 2 + %1], m3 +%endmacro + +%macro FILTERH_W4_2_sse3 2 + movh m3, [r0 + %1] + movhps m3, [r0 + %1 + 2] + pmaddwd m3, m0 + movh m4, [r0 + %1 + 4] + movhps m4, [r0 + %1 + 6] + pmaddwd m4, m0 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m3, m3, q3120 + pshufd m4, m4, q3120 + punpcklqdq m3, m4 + paddd m3, m1 + + movh m5, [r0 + r1 + %1] + movhps m5, [r0 + r1 + %1 + 2] + pmaddwd m5, m0 + movh m4, [r0 + r1 + %1 + 4] + movhps m4, [r0 + r1 + %1 + 6] + pmaddwd m4, m0 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m5, m5, q3120 + pshufd m4, m4, q3120 + punpcklqdq m5, m4 + paddd m5, m1 +%ifidn %2, pp + psrad m3, 6 + psrad m5, 6 + packssdw m3, m5 + CLIPW m3, m7, m6 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2 + %1], m3 + movhps [r2 + r3 + %1], m3 +%endmacro + +%macro FILTERH_W4_1_sse3 1 + movh m3, [r0 + 2 * r1 + %1] + movhps m3, [r0 + 2 * r1 + %1 + 2] + pmaddwd m3, m0 + movh m4, [r0 + 2 * r1 + %1 + 4] + movhps m4, [r0 + 2 * r1 + %1 + 6] + pmaddwd m4, m0 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m3, m3, q3120 + pshufd m4, m4, q3120 + punpcklqdq m3, m4 + paddd m3, m1 + + psrad m3, INTERP_SHIFT_PS + packssdw m3, m3 + movh [r2 + r3 * 2 + %1], m3 +%endmacro + +%macro FILTERH_W8_1_sse3 2 + movh m3, [r0 + %1] + movhps m3, [r0 + %1 + 2] + pmaddwd m3, m0 + movh m4, [r0 + %1 + 4] + movhps m4, [r0 + %1 + 6] + pmaddwd m4, m0 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m3, m3, q3120 + pshufd m4, m4, q3120 + punpcklqdq m3, m4 + paddd m3, m1 + + movh m5, [r0 + %1 + 8] + movhps m5, [r0 + %1 + 10] + pmaddwd m5, m0 + movh m4, [r0 + %1 + 12] + movhps m4, [r0 + %1 + 14] + pmaddwd m4, m0 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m5, m5, q3120 + pshufd m4, m4, q3120 + punpcklqdq m5, m4 + paddd m5, m1 +%ifidn %2, pp + psrad m3, 6 + psrad m5, 6 + packssdw m3, m5 + CLIPW m3, m7, m6 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movdqu [r2 + %1], m3 +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_HOR_CHROMA_sse3 3 +INIT_XMM sse3 +cglobal interp_4tap_horiz_%3_%1x%2, 4, 7, 8 + add r3, r3 + add r1, r1 + sub r0, 2 + mov r4d, r4m + add r4d, r4d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + movddup m0, [r6 + r4 * 4] +%else + movddup m0, [tab_ChromaCoeff + r4 * 4] +%endif + +%ifidn %3, ps + mova m1, [INTERP_OFFSET_PS] + cmp r5m, byte 0 +%if %1 <= 6 + lea r4, [r1 * 3] + lea r5, [r3 * 3] +%endif + je .skip + sub r0, r1 +%if %1 <= 6 +%assign y 1 +%else +%assign y 3 +%endif +%assign z 0 +%rep y +%assign x 0 +%rep %1/8 + FILTERH_W8_1_sse3 x, %3 +%assign x x+16 +%endrep +%if %1 == 4 || (%1 == 6 && z == 0) || (%1 == 12 && z == 0) + FILTERH_W4_2_sse3 x, %3 + FILTERH_W4_1_sse3 x +%assign x x+8 +%endif +%if %1 == 2 || (%1 == 6 && z == 0) + FILTERH_W2_3_sse3 x +%endif +%if %1 <= 6 + lea r0, [r0 + r4] + lea r2, [r2 + r5] +%else + lea r0, [r0 + r1] + lea r2, [r2 + r3] +%endif +%assign z z+1 +%endrep +.skip: +%elifidn %3, pp + pxor m7, m7 + mova m6, [pw_pixel_max] + mova m1, [tab_c_32] +%if %1 == 2 || %1 == 6 + lea r4, [r1 * 3] + lea r5, [r3 * 3] +%endif +%endif + +%if %1 == 2 +%assign y %2/4 +%elif %1 <= 6 +%assign y %2/2 +%else +%assign y %2 +%endif +%assign z 0 +%rep y +%assign x 0 +%rep %1/8 + FILTERH_W8_1_sse3 x, %3 +%assign x x+16 +%endrep +%if %1 == 4 || %1 == 6 || (%1 == 12 && (z % 2) == 0) + FILTERH_W4_2_sse3 x, %3 +%assign x x+8 +%endif +%if %1 == 2 || (%1 == 6 && (z % 2) == 0) + FILTERH_W2_4_sse3 x, %3 +%endif +%assign z z+1 +%if z < y +%if %1 == 2 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%elif %1 <= 6 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%else + lea r0, [r0 + r1] + lea r2, [r2 + r3] +%endif +%endif ;z < y +%endrep + + RET +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- + +FILTER_HOR_CHROMA_sse3 2, 4, pp +FILTER_HOR_CHROMA_sse3 2, 8, pp +FILTER_HOR_CHROMA_sse3 2, 16, pp +FILTER_HOR_CHROMA_sse3 4, 2, pp +FILTER_HOR_CHROMA_sse3 4, 4, pp +FILTER_HOR_CHROMA_sse3 4, 8, pp +FILTER_HOR_CHROMA_sse3 4, 16, pp +FILTER_HOR_CHROMA_sse3 4, 32, pp +FILTER_HOR_CHROMA_sse3 6, 8, pp +FILTER_HOR_CHROMA_sse3 6, 16, pp +FILTER_HOR_CHROMA_sse3 8, 2, pp +FILTER_HOR_CHROMA_sse3 8, 4, pp +FILTER_HOR_CHROMA_sse3 8, 6, pp +FILTER_HOR_CHROMA_sse3 8, 8, pp +FILTER_HOR_CHROMA_sse3 8, 12, pp +FILTER_HOR_CHROMA_sse3 8, 16, pp +FILTER_HOR_CHROMA_sse3 8, 32, pp +FILTER_HOR_CHROMA_sse3 8, 64, pp +FILTER_HOR_CHROMA_sse3 12, 16, pp +FILTER_HOR_CHROMA_sse3 12, 32, pp +FILTER_HOR_CHROMA_sse3 16, 4, pp +FILTER_HOR_CHROMA_sse3 16, 8, pp +FILTER_HOR_CHROMA_sse3 16, 12, pp +FILTER_HOR_CHROMA_sse3 16, 16, pp +FILTER_HOR_CHROMA_sse3 16, 24, pp +FILTER_HOR_CHROMA_sse3 16, 32, pp +FILTER_HOR_CHROMA_sse3 16, 64, pp +FILTER_HOR_CHROMA_sse3 24, 32, pp +FILTER_HOR_CHROMA_sse3 24, 64, pp +FILTER_HOR_CHROMA_sse3 32, 8, pp +FILTER_HOR_CHROMA_sse3 32, 16, pp +FILTER_HOR_CHROMA_sse3 32, 24, pp +FILTER_HOR_CHROMA_sse3 32, 32, pp +FILTER_HOR_CHROMA_sse3 32, 48, pp +FILTER_HOR_CHROMA_sse3 32, 64, pp +FILTER_HOR_CHROMA_sse3 48, 64, pp +FILTER_HOR_CHROMA_sse3 64, 16, pp +FILTER_HOR_CHROMA_sse3 64, 32, pp +FILTER_HOR_CHROMA_sse3 64, 48, pp +FILTER_HOR_CHROMA_sse3 64, 64, pp + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- + +FILTER_HOR_CHROMA_sse3 2, 4, ps +FILTER_HOR_CHROMA_sse3 2, 8, ps +FILTER_HOR_CHROMA_sse3 2, 16, ps +FILTER_HOR_CHROMA_sse3 4, 2, ps +FILTER_HOR_CHROMA_sse3 4, 4, ps +FILTER_HOR_CHROMA_sse3 4, 8, ps +FILTER_HOR_CHROMA_sse3 4, 16, ps +FILTER_HOR_CHROMA_sse3 4, 32, ps +FILTER_HOR_CHROMA_sse3 6, 8, ps +FILTER_HOR_CHROMA_sse3 6, 16, ps +FILTER_HOR_CHROMA_sse3 8, 2, ps +FILTER_HOR_CHROMA_sse3 8, 4, ps +FILTER_HOR_CHROMA_sse3 8, 6, ps +FILTER_HOR_CHROMA_sse3 8, 8, ps +FILTER_HOR_CHROMA_sse3 8, 12, ps +FILTER_HOR_CHROMA_sse3 8, 16, ps +FILTER_HOR_CHROMA_sse3 8, 32, ps +FILTER_HOR_CHROMA_sse3 8, 64, ps +FILTER_HOR_CHROMA_sse3 12, 16, ps +FILTER_HOR_CHROMA_sse3 12, 32, ps +FILTER_HOR_CHROMA_sse3 16, 4, ps +FILTER_HOR_CHROMA_sse3 16, 8, ps +FILTER_HOR_CHROMA_sse3 16, 12, ps +FILTER_HOR_CHROMA_sse3 16, 16, ps +FILTER_HOR_CHROMA_sse3 16, 24, ps +FILTER_HOR_CHROMA_sse3 16, 32, ps +FILTER_HOR_CHROMA_sse3 16, 64, ps +FILTER_HOR_CHROMA_sse3 24, 32, ps +FILTER_HOR_CHROMA_sse3 24, 64, ps +FILTER_HOR_CHROMA_sse3 32, 8, ps +FILTER_HOR_CHROMA_sse3 32, 16, ps +FILTER_HOR_CHROMA_sse3 32, 24, ps +FILTER_HOR_CHROMA_sse3 32, 32, ps +FILTER_HOR_CHROMA_sse3 32, 48, ps +FILTER_HOR_CHROMA_sse3 32, 64, ps +FILTER_HOR_CHROMA_sse3 48, 64, ps +FILTER_HOR_CHROMA_sse3 64, 16, ps +FILTER_HOR_CHROMA_sse3 64, 32, ps +FILTER_HOR_CHROMA_sse3 64, 48, ps +FILTER_HOR_CHROMA_sse3 64, 64, ps + +%macro FILTER_P2S_2_4_sse2 1 + movd m0, [r0 + %1] + movd m2, [r0 + r1 * 2 + %1] + movhps m0, [r0 + r1 + %1] + movhps m2, [r0 + r4 + %1] + psllw m0, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psubw m0, m1 + psubw m2, m1 + + movd [r2 + r3 * 0 + %1], m0 + movd [r2 + r3 * 2 + %1], m2 + movhlps m0, m0 + movhlps m2, m2 + movd [r2 + r3 * 1 + %1], m0 + movd [r2 + r5 + %1], m2 +%endmacro + +%macro FILTER_P2S_4_4_sse2 1 + movh m0, [r0 + %1] + movhps m0, [r0 + r1 + %1] + psllw m0, (14 - BIT_DEPTH) + psubw m0, m1 + movh [r2 + r3 * 0 + %1], m0 + movhps [r2 + r3 * 1 + %1], m0 + + movh m2, [r0 + r1 * 2 + %1] + movhps m2, [r0 + r4 + %1] + psllw m2, (14 - BIT_DEPTH) + psubw m2, m1 + movh [r2 + r3 * 2 + %1], m2 + movhps [r2 + r5 + %1], m2 +%endmacro + +%macro FILTER_P2S_4_2_sse2 0 + movh m0, [r0] + movhps m0, [r0 + r1 * 2] + psllw m0, (14 - BIT_DEPTH) + psubw m0, [pw_2000] + movh [r2 + r3 * 0], m0 + movhps [r2 + r3 * 2], m0 +%endmacro + +%macro FILTER_P2S_8_4_sse2 1 + movu m0, [r0 + %1] + movu m2, [r0 + r1 + %1] + psllw m0, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psubw m0, m1 + psubw m2, m1 + movu [r2 + r3 * 0 + %1], m0 + movu [r2 + r3 * 1 + %1], m2 + + movu m3, [r0 + r1 * 2 + %1] + movu m4, [r0 + r4 + %1] + psllw m3, (14 - BIT_DEPTH) + psllw m4, (14 - BIT_DEPTH) + psubw m3, m1 + psubw m4, m1 + movu [r2 + r3 * 2 + %1], m3 + movu [r2 + r5 + %1], m4 +%endmacro + +%macro FILTER_P2S_8_2_sse2 1 + movu m0, [r0 + %1] + movu m2, [r0 + r1 + %1] + psllw m0, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psubw m0, m1 + psubw m2, m1 + movu [r2 + r3 * 0 + %1], m0 + movu [r2 + r3 * 1 + %1], m2 +%endmacro + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro FILTER_PIX_TO_SHORT_sse2 2 +INIT_XMM sse2 +cglobal filterPixelToShort_%1x%2, 4, 6, 3 +%if %2 == 2 +%if %1 == 4 + FILTER_P2S_4_2_sse2 +%elif %1 == 8 + add r1d, r1d + add r3d, r3d + mova m1, [pw_2000] + FILTER_P2S_8_2_sse2 0 +%endif +%else + add r1d, r1d + add r3d, r3d + mova m1, [pw_2000] + lea r4, [r1 * 3] + lea r5, [r3 * 3] +%assign y 1 +%rep %2/4 +%assign x 0 +%rep %1/8 + FILTER_P2S_8_4_sse2 x +%if %2 == 6 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + FILTER_P2S_8_2_sse2 x +%endif +%assign x x+16 +%endrep +%rep (%1 % 8)/4 + FILTER_P2S_4_4_sse2 x +%assign x x+8 +%endrep +%rep (%1 % 4)/2 + FILTER_P2S_2_4_sse2 x +%endrep +%if y < %2/4 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%assign y y+1 +%endif +%endrep +%endif +RET +%endmacro + + FILTER_PIX_TO_SHORT_sse2 2, 4 + FILTER_PIX_TO_SHORT_sse2 2, 8 + FILTER_PIX_TO_SHORT_sse2 2, 16 + FILTER_PIX_TO_SHORT_sse2 4, 2 + FILTER_PIX_TO_SHORT_sse2 4, 4 + FILTER_PIX_TO_SHORT_sse2 4, 8 + FILTER_PIX_TO_SHORT_sse2 4, 16 + FILTER_PIX_TO_SHORT_sse2 4, 32 + FILTER_PIX_TO_SHORT_sse2 6, 8 + FILTER_PIX_TO_SHORT_sse2 6, 16 + FILTER_PIX_TO_SHORT_sse2 8, 2 + FILTER_PIX_TO_SHORT_sse2 8, 4 + FILTER_PIX_TO_SHORT_sse2 8, 6 + FILTER_PIX_TO_SHORT_sse2 8, 8 + FILTER_PIX_TO_SHORT_sse2 8, 12 + FILTER_PIX_TO_SHORT_sse2 8, 16 + FILTER_PIX_TO_SHORT_sse2 8, 32 + FILTER_PIX_TO_SHORT_sse2 8, 64 + FILTER_PIX_TO_SHORT_sse2 12, 16 + FILTER_PIX_TO_SHORT_sse2 12, 32 + FILTER_PIX_TO_SHORT_sse2 16, 4 + FILTER_PIX_TO_SHORT_sse2 16, 8 + FILTER_PIX_TO_SHORT_sse2 16, 12 + FILTER_PIX_TO_SHORT_sse2 16, 16 + FILTER_PIX_TO_SHORT_sse2 16, 24 + FILTER_PIX_TO_SHORT_sse2 16, 32 + FILTER_PIX_TO_SHORT_sse2 16, 64 + FILTER_PIX_TO_SHORT_sse2 24, 32 + FILTER_PIX_TO_SHORT_sse2 24, 64 + FILTER_PIX_TO_SHORT_sse2 32, 8 + FILTER_PIX_TO_SHORT_sse2 32, 16 + FILTER_PIX_TO_SHORT_sse2 32, 24 + FILTER_PIX_TO_SHORT_sse2 32, 32 + FILTER_PIX_TO_SHORT_sse2 32, 48 + FILTER_PIX_TO_SHORT_sse2 32, 64 + FILTER_PIX_TO_SHORT_sse2 48, 64 + FILTER_PIX_TO_SHORT_sse2 64, 16 + FILTER_PIX_TO_SHORT_sse2 64, 32 + FILTER_PIX_TO_SHORT_sse2 64, 48 + FILTER_PIX_TO_SHORT_sse2 64, 64 + +;------------------------------------------------------------------------------------------------------------ +; void interp_8tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------ +%macro FILTER_HOR_LUMA_W4 3 +INIT_XMM sse4 +cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 8 + mov r4d, r4m + sub r0, 6 + shl r4d, 4 + add r1, r1 + add r3, r3 + +%ifdef PIC + lea r6, [tab_LumaCoeff] + mova m0, [r6 + r4] +%else + mova m0, [tab_LumaCoeff + r4] +%endif + +%ifidn %3, pp + mova m1, [pd_32] + pxor m6, m6 + mova m7, [pw_pixel_max] +%else + mova m1, [INTERP_OFFSET_PS] +%endif + + mov r4d, %2 +%ifidn %3, ps + cmp r5m, byte 0 + je .loopH + lea r6, [r1 + 2 * r1] + sub r0, r6 + add r4d, 7 +%endif + +.loopH: + movu m2, [r0] ; m2 = src[0-7] + movu m3, [r0 + 16] ; m3 = src[8-15] + + pmaddwd m4, m2, m0 + palignr m5, m3, m2, 2 ; m5 = src[1-8] + pmaddwd m5, m0 + phaddd m4, m5 + + palignr m5, m3, m2, 4 ; m5 = src[2-9] + pmaddwd m5, m0 + palignr m3, m2, 6 ; m3 = src[3-10] + pmaddwd m3, m0 + phaddd m5, m3 + + phaddd m4, m5 + paddd m4, m1 +%ifidn %3, pp + psrad m4, 6 + packusdw m4, m4 + CLIPW m4, m6, m7 +%else + psrad m4, INTERP_SHIFT_PS + packssdw m4, m4 +%endif + + movh [r2], m4 + + add r0, r1 + add r2, r3 + + dec r4d + jnz .loopH + RET +%endmacro + +;------------------------------------------------------------------------------------------------------------ +; void interp_8tap_horiz_pp_4x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------ +FILTER_HOR_LUMA_W4 4, 4, pp +FILTER_HOR_LUMA_W4 4, 8, pp +FILTER_HOR_LUMA_W4 4, 16, pp + +;--------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_ps_4x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;--------------------------------------------------------------------------------------------------------------------------- +FILTER_HOR_LUMA_W4 4, 4, ps +FILTER_HOR_LUMA_W4 4, 8, ps +FILTER_HOR_LUMA_W4 4, 16, ps + +;------------------------------------------------------------------------------------------------------------ +; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------ +%macro FILTER_HOR_LUMA_W8 3 +INIT_XMM sse4 +cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 8 + + add r1, r1 + add r3, r3 + mov r4d, r4m + sub r0, 6 + shl r4d, 4 + +%ifdef PIC + lea r6, [tab_LumaCoeff] + mova m0, [r6 + r4] +%else + mova m0, [tab_LumaCoeff + r4] +%endif + +%ifidn %3, pp + mova m1, [pd_32] + pxor m7, m7 +%else + mova m1, [INTERP_OFFSET_PS] +%endif + + mov r4d, %2 +%ifidn %3, ps + cmp r5m, byte 0 + je .loopH + lea r6, [r1 + 2 * r1] + sub r0, r6 + add r4d, 7 +%endif + +.loopH: + movu m2, [r0] ; m2 = src[0-7] + movu m3, [r0 + 16] ; m3 = src[8-15] + + pmaddwd m4, m2, m0 + palignr m5, m3, m2, 2 ; m5 = src[1-8] + pmaddwd m5, m0 + phaddd m4, m5 + + palignr m5, m3, m2, 4 ; m5 = src[2-9] + pmaddwd m5, m0 + palignr m6, m3, m2, 6 ; m6 = src[3-10] + pmaddwd m6, m0 + phaddd m5, m6 + phaddd m4, m5 + paddd m4, m1 + + palignr m5, m3, m2, 8 ; m5 = src[4-11] + pmaddwd m5, m0 + palignr m6, m3, m2, 10 ; m6 = src[5-12] + pmaddwd m6, m0 + phaddd m5, m6 + + palignr m6, m3, m2, 12 ; m6 = src[6-13] + pmaddwd m6, m0 + palignr m3, m2, 14 ; m3 = src[7-14] + pmaddwd m3, m0 + phaddd m6, m3 + phaddd m5, m6 + paddd m5, m1 +%ifidn %3, pp + psrad m4, 6 + psrad m5, 6 + packusdw m4, m5 + CLIPW m4, m7, [pw_pixel_max] +%else + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m4, m5 +%endif + + movu [r2], m4 + + add r0, r1 + add r2, r3 + + dec r4d + jnz .loopH + RET +%endmacro + +;------------------------------------------------------------------------------------------------------------ +; void interp_8tap_horiz_pp_8x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------ +FILTER_HOR_LUMA_W8 8, 4, pp +FILTER_HOR_LUMA_W8 8, 8, pp +FILTER_HOR_LUMA_W8 8, 16, pp +FILTER_HOR_LUMA_W8 8, 32, pp + +;--------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_ps_8x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;--------------------------------------------------------------------------------------------------------------------------- +FILTER_HOR_LUMA_W8 8, 4, ps +FILTER_HOR_LUMA_W8 8, 8, ps +FILTER_HOR_LUMA_W8 8, 16, ps +FILTER_HOR_LUMA_W8 8, 32, ps + +;-------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;-------------------------------------------------------------------------------------------------------------- +%macro FILTER_HOR_LUMA_W12 3 +INIT_XMM sse4 +cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 8 + + add r1, r1 + add r3, r3 + mov r4d, r4m + sub r0, 6 + shl r4d, 4 + +%ifdef PIC + lea r6, [tab_LumaCoeff] + mova m0, [r6 + r4] +%else + mova m0, [tab_LumaCoeff + r4] +%endif +%ifidn %3, pp + mova m1, [INTERP_OFFSET_PP] +%else + mova m1, [INTERP_OFFSET_PS] +%endif + + mov r4d, %2 +%ifidn %3, ps + cmp r5m, byte 0 + je .loopH + lea r6, [r1 + 2 * r1] + sub r0, r6 + add r4d, 7 +%endif + +.loopH: + movu m2, [r0] ; m2 = src[0-7] + movu m3, [r0 + 16] ; m3 = src[8-15] + + pmaddwd m4, m2, m0 + palignr m5, m3, m2, 2 ; m5 = src[1-8] + pmaddwd m5, m0 + phaddd m4, m5 + + palignr m5, m3, m2, 4 ; m5 = src[2-9] + pmaddwd m5, m0 + palignr m6, m3, m2, 6 ; m6 = src[3-10] + pmaddwd m6, m0 + phaddd m5, m6 + phaddd m4, m5 + paddd m4, m1 + + palignr m5, m3, m2, 8 ; m5 = src[4-11] + pmaddwd m5, m0 + palignr m6, m3, m2, 10 ; m6 = src[5-12] + pmaddwd m6, m0 + phaddd m5, m6 + + palignr m6, m3, m2, 12 ; m6 = src[6-13] + pmaddwd m6, m0 + palignr m7, m3, m2, 14 ; m2 = src[7-14] + pmaddwd m7, m0 + phaddd m6, m7 + phaddd m5, m6 + paddd m5, m1 +%ifidn %3, pp + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m4, m5 + pxor m5, m5 + CLIPW m4, m5, [pw_pixel_max] +%else + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m4, m5 +%endif + + movu [r2], m4 + + movu m2, [r0 + 32] ; m2 = src[16-23] + + pmaddwd m4, m3, m0 ; m3 = src[8-15] + palignr m5, m2, m3, 2 ; m5 = src[9-16] + pmaddwd m5, m0 + phaddd m4, m5 + + palignr m5, m2, m3, 4 ; m5 = src[10-17] + pmaddwd m5, m0 + palignr m2, m3, 6 ; m2 = src[11-18] + pmaddwd m2, m0 + phaddd m5, m2 + phaddd m4, m5 + paddd m4, m1 +%ifidn %3, pp + psrad m4, INTERP_SHIFT_PP + packusdw m4, m4 + pxor m5, m5 + CLIPW m4, m5, [pw_pixel_max] +%else + psrad m4, INTERP_SHIFT_PS + packssdw m4, m4 +%endif + + movh [r2 + 16], m4 + + add r0, r1 + add r2, r3 + + dec r4d + jnz .loopH + RET +%endmacro + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp_12x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +FILTER_HOR_LUMA_W12 12, 16, pp + +;---------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_ps_12x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;---------------------------------------------------------------------------------------------------------------------------- +FILTER_HOR_LUMA_W12 12, 16, ps + +;-------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;-------------------------------------------------------------------------------------------------------------- +%macro FILTER_HOR_LUMA_W16 3 +INIT_XMM sse4 +cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 8 + + add r1, r1 + add r3, r3 + mov r4d, r4m + sub r0, 6 + shl r4d, 4 + +%ifdef PIC + lea r6, [tab_LumaCoeff] + mova m0, [r6 + r4] +%else + mova m0, [tab_LumaCoeff + r4] +%endif + +%ifidn %3, pp + mova m1, [pd_32] +%else + mova m1, [INTERP_OFFSET_PS] +%endif + + mov r4d, %2 +%ifidn %3, ps + cmp r5m, byte 0 + je .loopH + lea r6, [r1 + 2 * r1] + sub r0, r6 + add r4d, 7 +%endif + +.loopH: +%assign x 0 +%rep %1 / 16 + movu m2, [r0 + x] ; m2 = src[0-7] + movu m3, [r0 + 16 + x] ; m3 = src[8-15] + + pmaddwd m4, m2, m0 + palignr m5, m3, m2, 2 ; m5 = src[1-8] + pmaddwd m5, m0 + phaddd m4, m5 + + palignr m5, m3, m2, 4 ; m5 = src[2-9] + pmaddwd m5, m0 + palignr m6, m3, m2, 6 ; m6 = src[3-10] + pmaddwd m6, m0 + phaddd m5, m6 + phaddd m4, m5 + paddd m4, m1 + + palignr m5, m3, m2, 8 ; m5 = src[4-11] + pmaddwd m5, m0 + palignr m6, m3, m2, 10 ; m6 = src[5-12] + pmaddwd m6, m0 + phaddd m5, m6 + + palignr m6, m3, m2, 12 ; m6 = src[6-13] + pmaddwd m6, m0 + palignr m7, m3, m2, 14 ; m2 = src[7-14] + pmaddwd m7, m0 + phaddd m6, m7 + phaddd m5, m6 + paddd m5, m1 +%ifidn %3, pp + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m4, m5 + pxor m5, m5 + CLIPW m4, m5, [pw_pixel_max] +%else + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m4, m5 +%endif + movu [r2 + x], m4 + + movu m2, [r0 + 32 + x] ; m2 = src[16-23] + + pmaddwd m4, m3, m0 ; m3 = src[8-15] + palignr m5, m2, m3, 2 ; m5 = src[9-16] + pmaddwd m5, m0 + phaddd m4, m5 + + palignr m5, m2, m3, 4 ; m5 = src[10-17] + pmaddwd m5, m0 + palignr m6, m2, m3, 6 ; m6 = src[11-18] + pmaddwd m6, m0 + phaddd m5, m6 + phaddd m4, m5 + paddd m4, m1 + + palignr m5, m2, m3, 8 ; m5 = src[12-19] + pmaddwd m5, m0 + palignr m6, m2, m3, 10 ; m6 = src[13-20] + pmaddwd m6, m0 + phaddd m5, m6 + + palignr m6, m2, m3, 12 ; m6 = src[14-21] + pmaddwd m6, m0 + palignr m2, m3, 14 ; m3 = src[15-22] + pmaddwd m2, m0 + phaddd m6, m2 + phaddd m5, m6 + paddd m5, m1 +%ifidn %3, pp + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m4, m5 + pxor m5, m5 + CLIPW m4, m5, [pw_pixel_max] +%else + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m4, m5 +%endif + movu [r2 + 16 + x], m4 + +%assign x x+32 +%endrep + + add r0, r1 + add r2, r3 + + dec r4d + jnz .loopH + RET +%endmacro + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp_16x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +FILTER_HOR_LUMA_W16 16, 4, pp +FILTER_HOR_LUMA_W16 16, 8, pp +FILTER_HOR_LUMA_W16 16, 12, pp +FILTER_HOR_LUMA_W16 16, 16, pp +FILTER_HOR_LUMA_W16 16, 32, pp +FILTER_HOR_LUMA_W16 16, 64, pp + +;---------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_ps_16x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;---------------------------------------------------------------------------------------------------------------------------- +FILTER_HOR_LUMA_W16 16, 4, ps +FILTER_HOR_LUMA_W16 16, 8, ps +FILTER_HOR_LUMA_W16 16, 12, ps +FILTER_HOR_LUMA_W16 16, 16, ps +FILTER_HOR_LUMA_W16 16, 32, ps +FILTER_HOR_LUMA_W16 16, 64, ps + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp_32x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +FILTER_HOR_LUMA_W16 32, 8, pp +FILTER_HOR_LUMA_W16 32, 16, pp +FILTER_HOR_LUMA_W16 32, 24, pp +FILTER_HOR_LUMA_W16 32, 32, pp +FILTER_HOR_LUMA_W16 32, 64, pp + +;---------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_ps_32x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;---------------------------------------------------------------------------------------------------------------------------- +FILTER_HOR_LUMA_W16 32, 8, ps +FILTER_HOR_LUMA_W16 32, 16, ps +FILTER_HOR_LUMA_W16 32, 24, ps +FILTER_HOR_LUMA_W16 32, 32, ps +FILTER_HOR_LUMA_W16 32, 64, ps + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp_48x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +FILTER_HOR_LUMA_W16 48, 64, pp + +;---------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_ps_48x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;---------------------------------------------------------------------------------------------------------------------------- +FILTER_HOR_LUMA_W16 48, 64, ps + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp_64x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +FILTER_HOR_LUMA_W16 64, 16, pp +FILTER_HOR_LUMA_W16 64, 32, pp +FILTER_HOR_LUMA_W16 64, 48, pp +FILTER_HOR_LUMA_W16 64, 64, pp + +;---------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_ps_64x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;---------------------------------------------------------------------------------------------------------------------------- +FILTER_HOR_LUMA_W16 64, 16, ps +FILTER_HOR_LUMA_W16 64, 32, ps +FILTER_HOR_LUMA_W16 64, 48, ps +FILTER_HOR_LUMA_W16 64, 64, ps + +;-------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;-------------------------------------------------------------------------------------------------------------- +%macro FILTER_HOR_LUMA_W24 3 +INIT_XMM sse4 +cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 8 + + add r1, r1 + add r3, r3 + mov r4d, r4m + sub r0, 6 + shl r4d, 4 + +%ifdef PIC + lea r6, [tab_LumaCoeff] + mova m0, [r6 + r4] +%else + mova m0, [tab_LumaCoeff + r4] +%endif +%ifidn %3, pp + mova m1, [pd_32] +%else + mova m1, [INTERP_OFFSET_PS] +%endif + + mov r4d, %2 +%ifidn %3, ps + cmp r5m, byte 0 + je .loopH + lea r6, [r1 + 2 * r1] + sub r0, r6 + add r4d, 7 +%endif + +.loopH: + movu m2, [r0] ; m2 = src[0-7] + movu m3, [r0 + 16] ; m3 = src[8-15] + + pmaddwd m4, m2, m0 + palignr m5, m3, m2, 2 ; m5 = src[1-8] + pmaddwd m5, m0 + phaddd m4, m5 + + palignr m5, m3, m2, 4 ; m5 = src[2-9] + pmaddwd m5, m0 + palignr m6, m3, m2, 6 ; m6 = src[3-10] + pmaddwd m6, m0 + phaddd m5, m6 + phaddd m4, m5 + paddd m4, m1 + + palignr m5, m3, m2, 8 ; m5 = src[4-11] + pmaddwd m5, m0 + palignr m6, m3, m2, 10 ; m6 = src[5-12] + pmaddwd m6, m0 + phaddd m5, m6 + + palignr m6, m3, m2, 12 ; m6 = src[6-13] + pmaddwd m6, m0 + palignr m7, m3, m2, 14 ; m7 = src[7-14] + pmaddwd m7, m0 + phaddd m6, m7 + phaddd m5, m6 + paddd m5, m1 +%ifidn %3, pp + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m4, m5 + pxor m5, m5 + CLIPW m4, m5, [pw_pixel_max] +%else + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m4, m5 +%endif + movu [r2], m4 + + movu m2, [r0 + 32] ; m2 = src[16-23] + + pmaddwd m4, m3, m0 ; m3 = src[8-15] + palignr m5, m2, m3, 2 ; m5 = src[1-8] + pmaddwd m5, m0 + phaddd m4, m5 + + palignr m5, m2, m3, 4 ; m5 = src[2-9] + pmaddwd m5, m0 + palignr m6, m2, m3, 6 ; m6 = src[3-10] + pmaddwd m6, m0 + phaddd m5, m6 + phaddd m4, m5 + paddd m4, m1 + + palignr m5, m2, m3, 8 ; m5 = src[4-11] + pmaddwd m5, m0 + palignr m6, m2, m3, 10 ; m6 = src[5-12] + pmaddwd m6, m0 + phaddd m5, m6 + + palignr m6, m2, m3, 12 ; m6 = src[6-13] + pmaddwd m6, m0 + palignr m7, m2, m3, 14 ; m7 = src[7-14] + pmaddwd m7, m0 + phaddd m6, m7 + phaddd m5, m6 + paddd m5, m1 +%ifidn %3, pp + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m4, m5 + pxor m5, m5 + CLIPW m4, m5, [pw_pixel_max] +%else + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m4, m5 +%endif + movu [r2 + 16], m4 + + movu m3, [r0 + 48] ; m3 = src[24-31] + + pmaddwd m4, m2, m0 ; m2 = src[16-23] + palignr m5, m3, m2, 2 ; m5 = src[1-8] + pmaddwd m5, m0 + phaddd m4, m5 + + palignr m5, m3, m2, 4 ; m5 = src[2-9] + pmaddwd m5, m0 + palignr m6, m3, m2, 6 ; m6 = src[3-10] + pmaddwd m6, m0 + phaddd m5, m6 + phaddd m4, m5 + paddd m4, m1 + + palignr m5, m3, m2, 8 ; m5 = src[4-11] + pmaddwd m5, m0 + palignr m6, m3, m2, 10 ; m6 = src[5-12] + pmaddwd m6, m0 + phaddd m5, m6 + + palignr m6, m3, m2, 12 ; m6 = src[6-13] + pmaddwd m6, m0 + palignr m7, m3, m2, 14 ; m7 = src[7-14] + pmaddwd m7, m0 + phaddd m6, m7 + phaddd m5, m6 + paddd m5, m1 +%ifidn %3, pp + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m4, m5 + pxor m5, m5 + CLIPW m4, m5, [pw_pixel_max] +%else + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m4, m5 +%endif + movu [r2 + 32], m4 + + add r0, r1 + add r2, r3 + + dec r4d + jnz .loopH + RET +%endmacro + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp_24x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +FILTER_HOR_LUMA_W24 24, 32, pp + +;---------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_ps_24x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;---------------------------------------------------------------------------------------------------------------------------- +FILTER_HOR_LUMA_W24 24, 32, ps + +%macro FILTER_W2_2 1 + movu m3, [r0] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + r1] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + packusdw m3, m3 + CLIPW m3, m7, m6 +%else + psrad m3, INTERP_SHIFT_PS + packssdw m3, m3 +%endif + movd [r2], m3 + pextrd [r2 + r3], m3, 1 +%endmacro + +%macro FILTER_W4_2 1 + movu m3, [r0] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + 4] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + + movu m5, [r0 + r1] + pshufb m5, m5, m2 + pmaddwd m5, m0 + movu m4, [r0 + r1 + 4] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m5, m4 + paddd m5, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m3, m5 + CLIPW m3, m7, m6 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2], m3 + movhps [r2 + r3], m3 +%endmacro + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro FILTER_HOR_LUMA_W4_avx2 1 +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_4x%1, 4,7,7 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4] + vpbroadcastq m1, [r5 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + lea r6, [pw_pixel_max] + mova m3, [interp8_hpp_shuf] + mova m6, [pd_32] + pxor m2, m2 + + ; register map + ; m0 , m1 interpolate coeff + + mov r4d, %1/2 + +.loop: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + phaddd m4, m4 + vpermq m4, m4, q3120 + paddd m4, m6 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [r6] + movq [r2], xm4 + + vbroadcasti128 m4, [r0 + r1] + vbroadcasti128 m5, [r0 + r1 + 8] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + phaddd m4, m4 + vpermq m4, m4, q3120 + paddd m4, m6 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [r6] + movq [r2 + r3], xm4 + + lea r2, [r2 + 2 * r3] + lea r0, [r0 + 2 * r1] + dec r4d + jnz .loop + RET +%endmacro +FILTER_HOR_LUMA_W4_avx2 4 +FILTER_HOR_LUMA_W4_avx2 8 +FILTER_HOR_LUMA_W4_avx2 16 + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro FILTER_HOR_LUMA_W8 1 +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_8x%1, 4,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4] + vpbroadcastq m1, [r5 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + mova m7, [pd_32] + pxor m2, m2 + + ; register map + ; m0 , m1 interpolate coeff + + mov r4d, %1/2 + +.loop: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 8] + vbroadcasti128 m6, [r0 + 16] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + r1] + vbroadcasti128 m5, [r0 + r1 + 8] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + r1 + 8] + vbroadcasti128 m6, [r0 + r1 + 16] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + r3], xm4 + + lea r2, [r2 + 2 * r3] + lea r0, [r0 + 2 * r1] + dec r4d + jnz .loop + RET +%endmacro +FILTER_HOR_LUMA_W8 4 +FILTER_HOR_LUMA_W8 8 +FILTER_HOR_LUMA_W8 16 +FILTER_HOR_LUMA_W8 32 + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro FILTER_HOR_LUMA_W16 1 +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_16x%1, 4,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4] + vpbroadcastq m1, [r5 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + mova m7, [pd_32] + pxor m2, m2 + + ; register map + ; m0 , m1 interpolate coeff + + mov r4d, %1 + +.loop: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 8] + vbroadcasti128 m6, [r0 + 16] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 24] + vbroadcasti128 m6, [r0 + 32] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + 16], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET +%endmacro +FILTER_HOR_LUMA_W16 4 +FILTER_HOR_LUMA_W16 8 +FILTER_HOR_LUMA_W16 12 +FILTER_HOR_LUMA_W16 16 +FILTER_HOR_LUMA_W16 32 +FILTER_HOR_LUMA_W16 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro FILTER_HOR_LUMA_W32 2 +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_%1x%2, 4,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4] + vpbroadcastq m1, [r5 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + mova m7, [pd_32] + pxor m2, m2 + + ; register map + ; m0 , m1 interpolate coeff + + mov r4d, %2 + +.loop: +%assign x 0 +%rep %1/16 + vbroadcasti128 m4, [r0 + x] + vbroadcasti128 m5, [r0 + 8 + x] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 8 + x] + vbroadcasti128 m6, [r0 + 16 + x] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + x], xm4 + + vbroadcasti128 m4, [r0 + 16 + x] + vbroadcasti128 m5, [r0 + 24 + x] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 24 + x] + vbroadcasti128 m6, [r0 + 32 + x] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + 16 + x], xm4 + +%assign x x+32 +%endrep + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET +%endmacro +FILTER_HOR_LUMA_W32 32, 8 +FILTER_HOR_LUMA_W32 32, 16 +FILTER_HOR_LUMA_W32 32, 24 +FILTER_HOR_LUMA_W32 32, 32 +FILTER_HOR_LUMA_W32 32, 64 +FILTER_HOR_LUMA_W32 64, 16 +FILTER_HOR_LUMA_W32 64, 32 +FILTER_HOR_LUMA_W32 64, 48 +FILTER_HOR_LUMA_W32 64, 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_12x16, 4,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4] + vpbroadcastq m1, [r5 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + mova m7, [pd_32] + pxor m2, m2 + + ; register map + ; m0 , m1 interpolate coeff + + mov r4d, 16 + +.loop: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 8] + vbroadcasti128 m6, [r0 + 16] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 24] + vbroadcasti128 m6, [r0 + 32] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movq [r2 + 16], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_24x32, 4,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4] + vpbroadcastq m1, [r5 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + mova m7, [pd_32] + pxor m2, m2 + + ; register map + ; m0 , m1 interpolate coeff + + mov r4d, 32 + +.loop: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 8] + vbroadcasti128 m6, [r0 + 16] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 24] + vbroadcasti128 m6, [r0 + 32] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + 16], xm4 + + vbroadcasti128 m4, [r0 + 32] + vbroadcasti128 m5, [r0 + 40] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 40] + vbroadcasti128 m6, [r0 + 48] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + 32], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_48x64, 4,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4] + vpbroadcastq m1, [r5 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + mova m7, [pd_32] + pxor m2, m2 + + ; register map + ; m0 , m1 interpolate coeff + + mov r4d, 64 + +.loop: +%assign x 0 +%rep 2 + vbroadcasti128 m4, [r0 + x] + vbroadcasti128 m5, [r0 + 8 + x] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 8 + x] + vbroadcasti128 m6, [r0 + 16 + x] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + x], xm4 + + vbroadcasti128 m4, [r0 + 16 + x] + vbroadcasti128 m5, [r0 + 24 + x] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 24 + x] + vbroadcasti128 m6, [r0 + 32 + x] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + 16 + x], xm4 + + vbroadcasti128 m4, [r0 + 32 + x] + vbroadcasti128 m5, [r0 + 40 + x] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 40 + x] + vbroadcasti128 m6, [r0 + 48 + x] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + 32 + x], xm4 + +%assign x x+48 +%endrep + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_CHROMA_H 6 +INIT_XMM sse4 +cglobal interp_4tap_horiz_%3_%1x%2, 4, %4, %5 + + add r3, r3 + add r1, r1 + sub r0, 2 + mov r4d, r4m + add r4d, r4d + +%ifdef PIC + lea r%6, [tab_ChromaCoeff] + movh m0, [r%6 + r4 * 4] +%else + movh m0, [tab_ChromaCoeff + r4 * 4] +%endif + + punpcklqdq m0, m0 + mova m2, [tab_Tm16] + +%ifidn %3, ps + mova m1, [INTERP_OFFSET_PS] + cmp r5m, byte 0 + je .skip + sub r0, r1 + movu m3, [r0] + pshufb m3, m3, m2 + pmaddwd m3, m0 + + %if %1 == 4 + movu m4, [r0 + 4] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + %else + phaddd m3, m3 + %endif + + paddd m3, m1 + psrad m3, INTERP_SHIFT_PS + packssdw m3, m3 + + %if %1 == 2 + movd [r2], m3 + %else + movh [r2], m3 + %endif + + add r0, r1 + add r2, r3 + FILTER_W%1_2 %3 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + +.skip: + +%else ;%ifidn %3, ps + pxor m7, m7 + mova m6, [pw_pixel_max] + mova m1, [tab_c_32] +%endif ;%ifidn %3, ps + + FILTER_W%1_2 %3 + +%rep (%2/2) - 1 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + FILTER_W%1_2 %3 +%endrep + RET +%endmacro + +FILTER_CHROMA_H 2, 4, pp, 6, 8, 5 +FILTER_CHROMA_H 2, 8, pp, 6, 8, 5 +FILTER_CHROMA_H 4, 2, pp, 6, 8, 5 +FILTER_CHROMA_H 4, 4, pp, 6, 8, 5 +FILTER_CHROMA_H 4, 8, pp, 6, 8, 5 +FILTER_CHROMA_H 4, 16, pp, 6, 8, 5 + +FILTER_CHROMA_H 2, 4, ps, 7, 5, 6 +FILTER_CHROMA_H 2, 8, ps, 7, 5, 6 +FILTER_CHROMA_H 4, 2, ps, 7, 6, 6 +FILTER_CHROMA_H 4, 4, ps, 7, 6, 6 +FILTER_CHROMA_H 4, 8, ps, 7, 6, 6 +FILTER_CHROMA_H 4, 16, ps, 7, 6, 6 + +FILTER_CHROMA_H 2, 16, pp, 6, 8, 5 +FILTER_CHROMA_H 4, 32, pp, 6, 8, 5 +FILTER_CHROMA_H 2, 16, ps, 7, 5, 6 +FILTER_CHROMA_H 4, 32, ps, 7, 6, 6 + + +%macro FILTER_W6_1 1 + movu m3, [r0] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + 4] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + + movu m4, [r0 + 8] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m4, m4 + paddd m4, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + psrad m4, INTERP_SHIFT_PP + packusdw m3, m4 + CLIPW m3, m6, m7 +%else + psrad m3, INTERP_SHIFT_PS + psrad m4, INTERP_SHIFT_PS + packssdw m3, m4 +%endif + movh [r2], m3 + pextrd [r2 + 8], m3, 2 +%endmacro + +cglobal chroma_filter_pp_6x1_internal + FILTER_W6_1 pp + ret + +cglobal chroma_filter_ps_6x1_internal + FILTER_W6_1 ps + ret + +%macro FILTER_W8_1 1 + movu m3, [r0] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + 4] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + + movu m5, [r0 + 8] + pshufb m5, m5, m2 + pmaddwd m5, m0 + movu m4, [r0 + 12] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m5, m4 + paddd m5, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m3, m5 + CLIPW m3, m6, m7 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2], m3 + movhps [r2 + 8], m3 +%endmacro + +cglobal chroma_filter_pp_8x1_internal + FILTER_W8_1 pp + ret + +cglobal chroma_filter_ps_8x1_internal + FILTER_W8_1 ps + ret + +%macro FILTER_W12_1 1 + movu m3, [r0] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + 4] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + + movu m5, [r0 + 8] + pshufb m5, m5, m2 + pmaddwd m5, m0 + movu m4, [r0 + 12] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m5, m4 + paddd m5, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m3, m5 + CLIPW m3, m6, m7 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2], m3 + movhps [r2 + 8], m3 + + movu m3, [r0 + 16] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + 20] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + packusdw m3, m3 + CLIPW m3, m6, m7 +%else + psrad m3, INTERP_SHIFT_PS + packssdw m3, m3 +%endif + movh [r2 + 16], m3 +%endmacro + +cglobal chroma_filter_pp_12x1_internal + FILTER_W12_1 pp + ret + +cglobal chroma_filter_ps_12x1_internal + FILTER_W12_1 ps + ret + +%macro FILTER_W16_1 1 + movu m3, [r0] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + 4] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + + movu m5, [r0 + 8] + pshufb m5, m5, m2 + pmaddwd m5, m0 + movu m4, [r0 + 12] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m5, m4 + paddd m5, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m3, m5 + CLIPW m3, m6, m7 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2], m3 + movhps [r2 + 8], m3 + + movu m3, [r0 + 16] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + 20] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + + movu m5, [r0 + 24] + pshufb m5, m5, m2 + pmaddwd m5, m0 + movu m4, [r0 + 28] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m5, m4 + paddd m5, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m3, m5 + CLIPW m3, m6, m7 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2 + 16], m3 + movhps [r2 + 24], m3 +%endmacro + +cglobal chroma_filter_pp_16x1_internal + FILTER_W16_1 pp + ret + +cglobal chroma_filter_ps_16x1_internal + FILTER_W16_1 ps + ret + +%macro FILTER_W24_1 1 + movu m3, [r0] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + 4] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + + movu m5, [r0 + 8] + pshufb m5, m5, m2 + pmaddwd m5, m0 + movu m4, [r0 + 12] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m5, m4 + paddd m5, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m3, m5 + CLIPW m3, m6, m7 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2], m3 + movhps [r2 + 8], m3 + + movu m3, [r0 + 16] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + 20] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + + movu m5, [r0 + 24] + pshufb m5, m5, m2 + pmaddwd m5, m0 + movu m4, [r0 + 28] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m5, m4 + paddd m5, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m3, m5 + CLIPW m3, m6, m7 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2 + 16], m3 + movhps [r2 + 24], m3 + + movu m3, [r0 + 32] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + 36] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + + movu m5, [r0 + 40] + pshufb m5, m5, m2 + pmaddwd m5, m0 + movu m4, [r0 + 44] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m5, m4 + paddd m5, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m3, m5 + CLIPW m3, m6, m7 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2 + 32], m3 + movhps [r2 + 40], m3 +%endmacro + +cglobal chroma_filter_pp_24x1_internal + FILTER_W24_1 pp + ret + +cglobal chroma_filter_ps_24x1_internal + FILTER_W24_1 ps + ret + +%macro FILTER_W32_1 1 + movu m3, [r0] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + 4] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + + movu m5, [r0 + 8] + pshufb m5, m5, m2 + pmaddwd m5, m0 + movu m4, [r0 + 12] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m5, m4 + paddd m5, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m3, m5 + CLIPW m3, m6, m7 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2], m3 + movhps [r2 + 8], m3 + + movu m3, [r0 + 16] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + 20] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + + movu m5, [r0 + 24] + pshufb m5, m5, m2 + pmaddwd m5, m0 + movu m4, [r0 + 28] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m5, m4 + paddd m5, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m3, m5 + CLIPW m3, m6, m7 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2 + 16], m3 + movhps [r2 + 24], m3 + + movu m3, [r0 + 32] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + 36] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + + movu m5, [r0 + 40] + pshufb m5, m5, m2 + pmaddwd m5, m0 + movu m4, [r0 + 44] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m5, m4 + paddd m5, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m3, m5 + CLIPW m3, m6, m7 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2 + 32], m3 + movhps [r2 + 40], m3 + + movu m3, [r0 + 48] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + 52] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + + movu m5, [r0 + 56] + pshufb m5, m5, m2 + pmaddwd m5, m0 + movu m4, [r0 + 60] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m5, m4 + paddd m5, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m3, m5 + CLIPW m3, m6, m7 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2 + 48], m3 + movhps [r2 + 56], m3 +%endmacro + +cglobal chroma_filter_pp_32x1_internal + FILTER_W32_1 pp + ret + +cglobal chroma_filter_ps_32x1_internal + FILTER_W32_1 ps + ret + +%macro FILTER_W8o_1 2 + movu m3, [r0 + %2] + pshufb m3, m3, m2 + pmaddwd m3, m0 + movu m4, [r0 + %2 + 4] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m1 + + movu m5, [r0 + %2 + 8] + pshufb m5, m5, m2 + pmaddwd m5, m0 + movu m4, [r0 + %2 + 12] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m5, m4 + paddd m5, m1 +%ifidn %1, pp + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + packusdw m3, m5 + CLIPW m3, m6, m7 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2 + %2], m3 + movhps [r2 + %2 + 8], m3 +%endmacro + +%macro FILTER_W48_1 1 + FILTER_W8o_1 %1, 0 + FILTER_W8o_1 %1, 16 + FILTER_W8o_1 %1, 32 + FILTER_W8o_1 %1, 48 + FILTER_W8o_1 %1, 64 + FILTER_W8o_1 %1, 80 +%endmacro + +cglobal chroma_filter_pp_48x1_internal + FILTER_W48_1 pp + ret + +cglobal chroma_filter_ps_48x1_internal + FILTER_W48_1 ps + ret + +%macro FILTER_W64_1 1 + FILTER_W8o_1 %1, 0 + FILTER_W8o_1 %1, 16 + FILTER_W8o_1 %1, 32 + FILTER_W8o_1 %1, 48 + FILTER_W8o_1 %1, 64 + FILTER_W8o_1 %1, 80 + FILTER_W8o_1 %1, 96 + FILTER_W8o_1 %1, 112 +%endmacro + +cglobal chroma_filter_pp_64x1_internal + FILTER_W64_1 pp + ret + +cglobal chroma_filter_ps_64x1_internal + FILTER_W64_1 ps + ret + + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- + +INIT_XMM sse4 +%macro IPFILTER_CHROMA 6 +cglobal interp_4tap_horiz_%3_%1x%2, 4, %5, %6 + + add r3, r3 + add r1, r1 + sub r0, 2 + mov r4d, r4m + add r4d, r4d + +%ifdef PIC + lea r%4, [tab_ChromaCoeff] + movh m0, [r%4 + r4 * 4] +%else + movh m0, [tab_ChromaCoeff + r4 * 4] +%endif + + punpcklqdq m0, m0 + mova m2, [tab_Tm16] + +%ifidn %3, ps + mova m1, [INTERP_OFFSET_PS] + cmp r5m, byte 0 + je .skip + sub r0, r1 + call chroma_filter_%3_%1x1_internal + add r0, r1 + add r2, r3 + call chroma_filter_%3_%1x1_internal + add r0, r1 + add r2, r3 + call chroma_filter_%3_%1x1_internal + add r0, r1 + add r2, r3 +.skip: +%else + mova m1, [tab_c_32] + pxor m6, m6 + mova m7, [pw_pixel_max] +%endif + + call chroma_filter_%3_%1x1_internal +%rep %2 - 1 + add r0, r1 + add r2, r3 + call chroma_filter_%3_%1x1_internal +%endrep +RET +%endmacro +IPFILTER_CHROMA 6, 8, pp, 5, 6, 8 +IPFILTER_CHROMA 8, 2, pp, 5, 6, 8 +IPFILTER_CHROMA 8, 4, pp, 5, 6, 8 +IPFILTER_CHROMA 8, 6, pp, 5, 6, 8 +IPFILTER_CHROMA 8, 8, pp, 5, 6, 8 +IPFILTER_CHROMA 8, 16, pp, 5, 6, 8 +IPFILTER_CHROMA 8, 32, pp, 5, 6, 8 +IPFILTER_CHROMA 12, 16, pp, 5, 6, 8 +IPFILTER_CHROMA 16, 4, pp, 5, 6, 8 +IPFILTER_CHROMA 16, 8, pp, 5, 6, 8 +IPFILTER_CHROMA 16, 12, pp, 5, 6, 8 +IPFILTER_CHROMA 16, 16, pp, 5, 6, 8 +IPFILTER_CHROMA 16, 32, pp, 5, 6, 8 +IPFILTER_CHROMA 24, 32, pp, 5, 6, 8 +IPFILTER_CHROMA 32, 8, pp, 5, 6, 8 +IPFILTER_CHROMA 32, 16, pp, 5, 6, 8 +IPFILTER_CHROMA 32, 24, pp, 5, 6, 8 +IPFILTER_CHROMA 32, 32, pp, 5, 6, 8 + +IPFILTER_CHROMA 6, 8, ps, 6, 7, 6 +IPFILTER_CHROMA 8, 2, ps, 6, 7, 6 +IPFILTER_CHROMA 8, 4, ps, 6, 7, 6 +IPFILTER_CHROMA 8, 6, ps, 6, 7, 6 +IPFILTER_CHROMA 8, 8, ps, 6, 7, 6 +IPFILTER_CHROMA 8, 16, ps, 6, 7, 6 +IPFILTER_CHROMA 8, 32, ps, 6, 7, 6 +IPFILTER_CHROMA 12, 16, ps, 6, 7, 6 +IPFILTER_CHROMA 16, 4, ps, 6, 7, 6 +IPFILTER_CHROMA 16, 8, ps, 6, 7, 6 +IPFILTER_CHROMA 16, 12, ps, 6, 7, 6 +IPFILTER_CHROMA 16, 16, ps, 6, 7, 6 +IPFILTER_CHROMA 16, 32, ps, 6, 7, 6 +IPFILTER_CHROMA 24, 32, ps, 6, 7, 6 +IPFILTER_CHROMA 32, 8, ps, 6, 7, 6 +IPFILTER_CHROMA 32, 16, ps, 6, 7, 6 +IPFILTER_CHROMA 32, 24, ps, 6, 7, 6 +IPFILTER_CHROMA 32, 32, ps, 6, 7, 6 + +IPFILTER_CHROMA 6, 16, pp, 5, 6, 8 +IPFILTER_CHROMA 8, 12, pp, 5, 6, 8 +IPFILTER_CHROMA 8, 64, pp, 5, 6, 8 +IPFILTER_CHROMA 12, 32, pp, 5, 6, 8 +IPFILTER_CHROMA 16, 24, pp, 5, 6, 8 +IPFILTER_CHROMA 16, 64, pp, 5, 6, 8 +IPFILTER_CHROMA 24, 64, pp, 5, 6, 8 +IPFILTER_CHROMA 32, 48, pp, 5, 6, 8 +IPFILTER_CHROMA 32, 64, pp, 5, 6, 8 +IPFILTER_CHROMA 6, 16, ps, 6, 7, 6 +IPFILTER_CHROMA 8, 12, ps, 6, 7, 6 +IPFILTER_CHROMA 8, 64, ps, 6, 7, 6 +IPFILTER_CHROMA 12, 32, ps, 6, 7, 6 +IPFILTER_CHROMA 16, 24, ps, 6, 7, 6 +IPFILTER_CHROMA 16, 64, ps, 6, 7, 6 +IPFILTER_CHROMA 24, 64, ps, 6, 7, 6 +IPFILTER_CHROMA 32, 48, ps, 6, 7, 6 +IPFILTER_CHROMA 32, 64, ps, 6, 7, 6 + +IPFILTER_CHROMA 48, 64, pp, 5, 6, 8 +IPFILTER_CHROMA 64, 48, pp, 5, 6, 8 +IPFILTER_CHROMA 64, 64, pp, 5, 6, 8 +IPFILTER_CHROMA 64, 32, pp, 5, 6, 8 +IPFILTER_CHROMA 64, 16, pp, 5, 6, 8 +IPFILTER_CHROMA 48, 64, ps, 6, 7, 6 +IPFILTER_CHROMA 64, 48, ps, 6, 7, 6 +IPFILTER_CHROMA 64, 64, ps, 6, 7, 6 +IPFILTER_CHROMA 64, 32, ps, 6, 7, 6 +IPFILTER_CHROMA 64, 16, ps, 6, 7, 6 + + +%macro PROCESS_CHROMA_SP_W4_4R 0 + movq m0, [r0] + movq m1, [r0 + r1] + punpcklwd m0, m1 ;m0=[0 1] + pmaddwd m0, [r6 + 0 *32] ;m0=[0+1] Row1 + + lea r0, [r0 + 2 * r1] + movq m4, [r0] + punpcklwd m1, m4 ;m1=[1 2] + pmaddwd m1, [r6 + 0 *32] ;m1=[1+2] Row2 + + movq m5, [r0 + r1] + punpcklwd m4, m5 ;m4=[2 3] + pmaddwd m2, m4, [r6 + 0 *32] ;m2=[2+3] Row3 + pmaddwd m4, [r6 + 1 * 32] + paddd m0, m4 ;m0=[0+1+2+3] Row1 done + + lea r0, [r0 + 2 * r1] + movq m4, [r0] + punpcklwd m5, m4 ;m5=[3 4] + pmaddwd m3, m5, [r6 + 0 *32] ;m3=[3+4] Row4 + pmaddwd m5, [r6 + 1 * 32] + paddd m1, m5 ;m1 = [1+2+3+4] Row2 + + movq m5, [r0 + r1] + punpcklwd m4, m5 ;m4=[4 5] + pmaddwd m4, [r6 + 1 * 32] + paddd m2, m4 ;m2=[2+3+4+5] Row3 + + movq m4, [r0 + 2 * r1] + punpcklwd m5, m4 ;m5=[5 6] + pmaddwd m5, [r6 + 1 * 32] + paddd m3, m5 ;m3=[3+4+5+6] Row4 +%endmacro + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%macro IPFILTER_CHROMA_avx2_6xN 1 +cglobal interp_4tap_horiz_pp_6x%1, 5,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r4d, %1/2 +.loop: + vbroadcasti128 m3, [r0] + vbroadcasti128 m4, [r0 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, INTERP_SHIFT_PP ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movq [r2], xm3 + pextrd [r2 + 8], xm3, 2 + + vbroadcasti128 m3, [r0 + r1] + vbroadcasti128 m4, [r0 + r1 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, INTERP_SHIFT_PP ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movq [r2 + r3], xm3 + pextrd [r2 + r3 + 8], xm3, 2 + + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] + dec r4d + jnz .loop + RET +%endmacro +IPFILTER_CHROMA_avx2_6xN 8 +IPFILTER_CHROMA_avx2_6xN 16 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_8x2, 5,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + vbroadcasti128 m3, [r0] + vbroadcasti128 m4, [r0 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, INTERP_SHIFT_PP ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3,q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movu [r2], xm3 + + vbroadcasti128 m3, [r0 + r1] + vbroadcasti128 m4, [r0 + r1 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, INTERP_SHIFT_PP ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3,q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movu [r2 + r3], xm3 + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_8x4, 5,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + +%rep 2 + vbroadcasti128 m3, [r0] + vbroadcasti128 m4, [r0 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3,q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movu [r2], xm3 + + vbroadcasti128 m3, [r0 + r1] + vbroadcasti128 m4, [r0 + r1 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3,q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movu [r2 + r3], xm3 + + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] +%endrep + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%macro IPFILTER_CHROMA_avx2_8xN 1 +cglobal interp_4tap_horiz_pp_8x%1, 5,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r4d, %1/2 +.loop: + vbroadcasti128 m3, [r0] + vbroadcasti128 m4, [r0 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movu [r2], xm3 + + vbroadcasti128 m3, [r0 + r1] + vbroadcasti128 m4, [r0 + r1 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movu [r2 + r3], xm3 + + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] + dec r4d + jnz .loop + RET +%endmacro +IPFILTER_CHROMA_avx2_8xN 6 +IPFILTER_CHROMA_avx2_8xN 8 +IPFILTER_CHROMA_avx2_8xN 12 +IPFILTER_CHROMA_avx2_8xN 16 +IPFILTER_CHROMA_avx2_8xN 32 +IPFILTER_CHROMA_avx2_8xN 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%macro IPFILTER_CHROMA_avx2_16xN 1 +%if ARCH_X86_64 +cglobal interp_4tap_horiz_pp_16x%1, 5,6,9 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r4d, %1 +.loop: + vbroadcasti128 m3, [r0] + vbroadcasti128 m4, [r0 + 8] + + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m8, [r0 + 24] + + pshufb m4, m1 + pshufb m8, m1 + + pmaddwd m4, m0 + pmaddwd m8, m0 + phaddd m4, m8 + paddd m4, m2 + psrad m4, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m4, m4 + vpermq m4, m4, q2020 + pshufb xm4, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + vinserti128 m3, m3, xm4, 1 + CLIPW m3, m5, m7 + movu [r2], m3 + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET +%endif +%endmacro +IPFILTER_CHROMA_avx2_16xN 4 +IPFILTER_CHROMA_avx2_16xN 8 +IPFILTER_CHROMA_avx2_16xN 12 +IPFILTER_CHROMA_avx2_16xN 16 +IPFILTER_CHROMA_avx2_16xN 24 +IPFILTER_CHROMA_avx2_16xN 32 +IPFILTER_CHROMA_avx2_16xN 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%macro IPFILTER_CHROMA_avx2_32xN 1 +%if ARCH_X86_64 +cglobal interp_4tap_horiz_pp_32x%1, 5,6,9 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r6d, %1 +.loop: +%assign x 0 +%rep 2 + vbroadcasti128 m3, [r0 + x] + vbroadcasti128 m4, [r0 + 8 + x] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + + vbroadcasti128 m4, [r0 + 16 + x] + vbroadcasti128 m8, [r0 + 24 + x] + pshufb m4, m1 + pshufb m8, m1 + + pmaddwd m4, m0 + pmaddwd m8, m0 + phaddd m4, m8 + paddd m4, m2 + psrad m4, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m4, m4 + vpermq m4, m4, q2020 + pshufb xm4, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + vinserti128 m3, m3, xm4, 1 + CLIPW m3, m5, m7 + movu [r2 + x], m3 + %assign x x+32 + %endrep + + add r0, r1 + add r2, r3 + dec r6d + jnz .loop + RET +%endif +%endmacro +IPFILTER_CHROMA_avx2_32xN 8 +IPFILTER_CHROMA_avx2_32xN 16 +IPFILTER_CHROMA_avx2_32xN 24 +IPFILTER_CHROMA_avx2_32xN 32 +IPFILTER_CHROMA_avx2_32xN 48 +IPFILTER_CHROMA_avx2_32xN 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%macro IPFILTER_CHROMA_avx2_12xN 1 +%if ARCH_X86_64 +cglobal interp_4tap_horiz_pp_12x%1, 5,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r4d, %1 +.loop: + vbroadcasti128 m3, [r0] + vbroadcasti128 m4, [r0 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movu [r2], xm3 + + vbroadcasti128 m3, [r0 + 16] + vbroadcasti128 m4, [r0 + 24] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movq [r2 + 16], xm3 + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET +%endif +%endmacro +IPFILTER_CHROMA_avx2_12xN 16 +IPFILTER_CHROMA_avx2_12xN 32 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%macro IPFILTER_CHROMA_avx2_24xN 1 +%if ARCH_X86_64 +cglobal interp_4tap_horiz_pp_24x%1, 5,6,9 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r4d, %1 +.loop: + vbroadcasti128 m3, [r0] + vbroadcasti128 m4, [r0 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m8, [r0 + 24] + pshufb m4, m1 + pshufb m8, m1 + + pmaddwd m4, m0 + pmaddwd m8, m0 + phaddd m4, m8 + paddd m4, m2 + psrad m4, 6 + + packusdw m3, m4 + vpermq m3, m3, q3120 + pshufb m3, m6 + CLIPW m3, m5, m7 + movu [r2], m3 + + vbroadcasti128 m3, [r0 + 32] + vbroadcasti128 m4, [r0 + 40] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 + CLIPW xm3, xm5, xm7 + movu [r2 + 32], xm3 + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET +%endif +%endmacro +IPFILTER_CHROMA_avx2_24xN 32 +IPFILTER_CHROMA_avx2_24xN 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%macro IPFILTER_CHROMA_avx2_64xN 1 +%if ARCH_X86_64 +cglobal interp_4tap_horiz_pp_64x%1, 5,6,9 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r6d, %1 +.loop: +%assign x 0 +%rep 4 + vbroadcasti128 m3, [r0 + x] + vbroadcasti128 m4, [r0 + 8 + x] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 + + vbroadcasti128 m4, [r0 + 16 + x] + vbroadcasti128 m8, [r0 + 24 + x] + pshufb m4, m1 + pshufb m8, m1 + + pmaddwd m4, m0 + pmaddwd m8, m0 + phaddd m4, m8 + paddd m4, m2 + psrad m4, 6 + + packusdw m3, m4 + vpermq m3, m3, q3120 + pshufb m3, m6 + CLIPW m3, m5, m7 + movu [r2 + x], m3 + %assign x x+32 + %endrep + + add r0, r1 + add r2, r3 + dec r6d + jnz .loop + RET +%endif +%endmacro +IPFILTER_CHROMA_avx2_64xN 16 +IPFILTER_CHROMA_avx2_64xN 32 +IPFILTER_CHROMA_avx2_64xN 48 +IPFILTER_CHROMA_avx2_64xN 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%if ARCH_X86_64 +cglobal interp_4tap_horiz_pp_48x64, 5,6,9 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r4d, 64 +.loop: +%assign x 0 +%rep 3 + vbroadcasti128 m3, [r0 + x] + vbroadcasti128 m4, [r0 + 8 + x] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 + + vbroadcasti128 m4, [r0 + 16 + x] + vbroadcasti128 m8, [r0 + 24 + x] + pshufb m4, m1 + pshufb m8, m1 + + pmaddwd m4, m0 + pmaddwd m8, m0 + phaddd m4, m8 + paddd m4, m2 + psrad m4, 6 + + packusdw m3, m4 + vpermq m3, m3, q3120 + pshufb m3, m6 + CLIPW m3, m5, m7 + movu [r2 + x], m3 +%assign x x+32 +%endrep + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET +%endif + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_%3_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_SS 4 +INIT_XMM sse2 +cglobal interp_4tap_vert_%3_%1x%2, 5, 7, %4 ,0-gprsize + + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r6, [r5 + r4] +%else + lea r6, [tab_ChromaCoeffV + r4] +%endif + + mov dword [rsp], %2/4 + +%ifnidn %3, ss + %ifnidn %3, ps + mova m7, [pw_pixel_max] + %ifidn %3, pp + mova m6, [INTERP_OFFSET_PP] + %else + mova m6, [INTERP_OFFSET_SP] + %endif + %else + mova m6, [INTERP_OFFSET_PS] + %endif +%endif + +.loopH: + mov r4d, (%1/4) +.loopW: + PROCESS_CHROMA_SP_W4_4R + +%ifidn %3, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%elifidn %3, ps + paddd m0, m6 + paddd m1, m6 + paddd m2, m6 + paddd m3, m6 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%else + paddd m0, m6 + paddd m1, m6 + paddd m2, m6 + paddd m3, m6 + %ifidn %3, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + %else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + %endif + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + CLIPW2 m0, m2, m5, m7 +%endif + + movh [r2], m0 + movhps [r2 + r3], m0 + lea r5, [r2 + 2 * r3] + movh [r5], m2 + movhps [r5 + r3], m2 + + lea r5, [4 * r1 - 2 * 4] + sub r0, r5 + add r2, 2 * 4 + + dec r4d + jnz .loopW + + lea r0, [r0 + 4 * r1 - 2 * %1] + lea r2, [r2 + 4 * r3 - 2 * %1] + + dec dword [rsp] + jnz .loopH + + RET +%endmacro + + FILTER_VER_CHROMA_SS 4, 4, ss, 6 + FILTER_VER_CHROMA_SS 4, 8, ss, 6 + FILTER_VER_CHROMA_SS 16, 16, ss, 6 + FILTER_VER_CHROMA_SS 16, 8, ss, 6 + FILTER_VER_CHROMA_SS 16, 12, ss, 6 + FILTER_VER_CHROMA_SS 12, 16, ss, 6 + FILTER_VER_CHROMA_SS 16, 4, ss, 6 + FILTER_VER_CHROMA_SS 4, 16, ss, 6 + FILTER_VER_CHROMA_SS 32, 32, ss, 6 + FILTER_VER_CHROMA_SS 32, 16, ss, 6 + FILTER_VER_CHROMA_SS 16, 32, ss, 6 + FILTER_VER_CHROMA_SS 32, 24, ss, 6 + FILTER_VER_CHROMA_SS 24, 32, ss, 6 + FILTER_VER_CHROMA_SS 32, 8, ss, 6 + + FILTER_VER_CHROMA_SS 4, 4, ps, 7 + FILTER_VER_CHROMA_SS 4, 8, ps, 7 + FILTER_VER_CHROMA_SS 16, 16, ps, 7 + FILTER_VER_CHROMA_SS 16, 8, ps, 7 + FILTER_VER_CHROMA_SS 16, 12, ps, 7 + FILTER_VER_CHROMA_SS 12, 16, ps, 7 + FILTER_VER_CHROMA_SS 16, 4, ps, 7 + FILTER_VER_CHROMA_SS 4, 16, ps, 7 + FILTER_VER_CHROMA_SS 32, 32, ps, 7 + FILTER_VER_CHROMA_SS 32, 16, ps, 7 + FILTER_VER_CHROMA_SS 16, 32, ps, 7 + FILTER_VER_CHROMA_SS 32, 24, ps, 7 + FILTER_VER_CHROMA_SS 24, 32, ps, 7 + FILTER_VER_CHROMA_SS 32, 8, ps, 7 + + FILTER_VER_CHROMA_SS 4, 4, sp, 8 + FILTER_VER_CHROMA_SS 4, 8, sp, 8 + FILTER_VER_CHROMA_SS 16, 16, sp, 8 + FILTER_VER_CHROMA_SS 16, 8, sp, 8 + FILTER_VER_CHROMA_SS 16, 12, sp, 8 + FILTER_VER_CHROMA_SS 12, 16, sp, 8 + FILTER_VER_CHROMA_SS 16, 4, sp, 8 + FILTER_VER_CHROMA_SS 4, 16, sp, 8 + FILTER_VER_CHROMA_SS 32, 32, sp, 8 + FILTER_VER_CHROMA_SS 32, 16, sp, 8 + FILTER_VER_CHROMA_SS 16, 32, sp, 8 + FILTER_VER_CHROMA_SS 32, 24, sp, 8 + FILTER_VER_CHROMA_SS 24, 32, sp, 8 + FILTER_VER_CHROMA_SS 32, 8, sp, 8 + + FILTER_VER_CHROMA_SS 4, 4, pp, 8 + FILTER_VER_CHROMA_SS 4, 8, pp, 8 + FILTER_VER_CHROMA_SS 16, 16, pp, 8 + FILTER_VER_CHROMA_SS 16, 8, pp, 8 + FILTER_VER_CHROMA_SS 16, 12, pp, 8 + FILTER_VER_CHROMA_SS 12, 16, pp, 8 + FILTER_VER_CHROMA_SS 16, 4, pp, 8 + FILTER_VER_CHROMA_SS 4, 16, pp, 8 + FILTER_VER_CHROMA_SS 32, 32, pp, 8 + FILTER_VER_CHROMA_SS 32, 16, pp, 8 + FILTER_VER_CHROMA_SS 16, 32, pp, 8 + FILTER_VER_CHROMA_SS 32, 24, pp, 8 + FILTER_VER_CHROMA_SS 24, 32, pp, 8 + FILTER_VER_CHROMA_SS 32, 8, pp, 8 + + + FILTER_VER_CHROMA_SS 16, 24, ss, 6 + FILTER_VER_CHROMA_SS 12, 32, ss, 6 + FILTER_VER_CHROMA_SS 4, 32, ss, 6 + FILTER_VER_CHROMA_SS 32, 64, ss, 6 + FILTER_VER_CHROMA_SS 16, 64, ss, 6 + FILTER_VER_CHROMA_SS 32, 48, ss, 6 + FILTER_VER_CHROMA_SS 24, 64, ss, 6 + + FILTER_VER_CHROMA_SS 16, 24, ps, 7 + FILTER_VER_CHROMA_SS 12, 32, ps, 7 + FILTER_VER_CHROMA_SS 4, 32, ps, 7 + FILTER_VER_CHROMA_SS 32, 64, ps, 7 + FILTER_VER_CHROMA_SS 16, 64, ps, 7 + FILTER_VER_CHROMA_SS 32, 48, ps, 7 + FILTER_VER_CHROMA_SS 24, 64, ps, 7 + + FILTER_VER_CHROMA_SS 16, 24, sp, 8 + FILTER_VER_CHROMA_SS 12, 32, sp, 8 + FILTER_VER_CHROMA_SS 4, 32, sp, 8 + FILTER_VER_CHROMA_SS 32, 64, sp, 8 + FILTER_VER_CHROMA_SS 16, 64, sp, 8 + FILTER_VER_CHROMA_SS 32, 48, sp, 8 + FILTER_VER_CHROMA_SS 24, 64, sp, 8 + + FILTER_VER_CHROMA_SS 16, 24, pp, 8 + FILTER_VER_CHROMA_SS 12, 32, pp, 8 + FILTER_VER_CHROMA_SS 4, 32, pp, 8 + FILTER_VER_CHROMA_SS 32, 64, pp, 8 + FILTER_VER_CHROMA_SS 16, 64, pp, 8 + FILTER_VER_CHROMA_SS 32, 48, pp, 8 + FILTER_VER_CHROMA_SS 24, 64, pp, 8 + + + FILTER_VER_CHROMA_SS 48, 64, ss, 6 + FILTER_VER_CHROMA_SS 64, 48, ss, 6 + FILTER_VER_CHROMA_SS 64, 64, ss, 6 + FILTER_VER_CHROMA_SS 64, 32, ss, 6 + FILTER_VER_CHROMA_SS 64, 16, ss, 6 + + FILTER_VER_CHROMA_SS 48, 64, ps, 7 + FILTER_VER_CHROMA_SS 64, 48, ps, 7 + FILTER_VER_CHROMA_SS 64, 64, ps, 7 + FILTER_VER_CHROMA_SS 64, 32, ps, 7 + FILTER_VER_CHROMA_SS 64, 16, ps, 7 + + FILTER_VER_CHROMA_SS 48, 64, sp, 8 + FILTER_VER_CHROMA_SS 64, 48, sp, 8 + FILTER_VER_CHROMA_SS 64, 64, sp, 8 + FILTER_VER_CHROMA_SS 64, 32, sp, 8 + FILTER_VER_CHROMA_SS 64, 16, sp, 8 + + FILTER_VER_CHROMA_SS 48, 64, pp, 8 + FILTER_VER_CHROMA_SS 64, 48, pp, 8 + FILTER_VER_CHROMA_SS 64, 64, pp, 8 + FILTER_VER_CHROMA_SS 64, 32, pp, 8 + FILTER_VER_CHROMA_SS 64, 16, pp, 8 + + +%macro PROCESS_CHROMA_SP_W2_4R 1 + movd m0, [r0] + movd m1, [r0 + r1] + punpcklwd m0, m1 ;m0=[0 1] + + lea r0, [r0 + 2 * r1] + movd m2, [r0] + punpcklwd m1, m2 ;m1=[1 2] + punpcklqdq m0, m1 ;m0=[0 1 1 2] + pmaddwd m0, [%1 + 0 *32] ;m0=[0+1 1+2] Row 1-2 + + movd m1, [r0 + r1] + punpcklwd m2, m1 ;m2=[2 3] + + lea r0, [r0 + 2 * r1] + movd m3, [r0] + punpcklwd m1, m3 ;m2=[3 4] + punpcklqdq m2, m1 ;m2=[2 3 3 4] + + pmaddwd m4, m2, [%1 + 1 * 32] ;m4=[2+3 3+4] Row 1-2 + pmaddwd m2, [%1 + 0 * 32] ;m2=[2+3 3+4] Row 3-4 + paddd m0, m4 ;m0=[0+1+2+3 1+2+3+4] Row 1-2 + + movd m1, [r0 + r1] + punpcklwd m3, m1 ;m3=[4 5] + + movd m4, [r0 + 2 * r1] + punpcklwd m1, m4 ;m1=[5 6] + punpcklqdq m3, m1 ;m2=[4 5 5 6] + pmaddwd m3, [%1 + 1 * 32] ;m3=[4+5 5+6] Row 3-4 + paddd m2, m3 ;m2=[2+3+4+5 3+4+5+6] Row 3-4 +%endmacro + +;--------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vertical_%2_2x%1(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;--------------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W2 3 +INIT_XMM sse4 +cglobal interp_4tap_vert_%2_2x%1, 5, 6, %3 + + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + + mov r4d, (%1/4) +%ifnidn %2, ss + %ifnidn %2, ps + pxor m7, m7 + mova m6, [pw_pixel_max] + %ifidn %2, pp + mova m5, [INTERP_OFFSET_PP] + %else + mova m5, [INTERP_OFFSET_SP] + %endif + %else + mova m5, [INTERP_OFFSET_PS] + %endif +%endif + +.loopH: + PROCESS_CHROMA_SP_W2_4R r5 +%ifidn %2, ss + psrad m0, 6 + psrad m2, 6 + packssdw m0, m2 +%elifidn %2, ps + paddd m0, m5 + paddd m2, m5 + psrad m0, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + packssdw m0, m2 +%else + paddd m0, m5 + paddd m2, m5 + %ifidn %2, pp + psrad m0, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + %else + psrad m0, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + %endif + packusdw m0, m2 + CLIPW m0, m7, m6 +%endif + + movd [r2], m0 + pextrd [r2 + r3], m0, 1 + lea r2, [r2 + 2 * r3] + pextrd [r2], m0, 2 + pextrd [r2 + r3], m0, 3 + + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loopH + RET +%endmacro + +FILTER_VER_CHROMA_W2 4, ss, 5 +FILTER_VER_CHROMA_W2 8, ss, 5 + +FILTER_VER_CHROMA_W2 4, pp, 8 +FILTER_VER_CHROMA_W2 8, pp, 8 + +FILTER_VER_CHROMA_W2 4, ps, 6 +FILTER_VER_CHROMA_W2 8, ps, 6 + +FILTER_VER_CHROMA_W2 4, sp, 8 +FILTER_VER_CHROMA_W2 8, sp, 8 + +FILTER_VER_CHROMA_W2 16, ss, 5 +FILTER_VER_CHROMA_W2 16, pp, 8 +FILTER_VER_CHROMA_W2 16, ps, 6 +FILTER_VER_CHROMA_W2 16, sp, 8 + + +;--------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_%1_4x2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;--------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W4 3 +INIT_XMM sse4 +cglobal interp_4tap_vert_%2_4x%1, 5, 6, %3 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + +%ifnidn %2, 2 + mov r4d, %1/2 +%endif + +%ifnidn %2, ss + %ifnidn %2, ps + pxor m6, m6 + mova m5, [pw_pixel_max] + %ifidn %2, pp + mova m4, [INTERP_OFFSET_PP] + %else + mova m4, [INTERP_OFFSET_SP] + %endif + %else + mova m4, [INTERP_OFFSET_PS] + %endif +%endif + +%ifnidn %2, 2 +.loop: +%endif + + movh m0, [r0] + movh m1, [r0 + r1] + punpcklwd m0, m1 ;m0=[0 1] + pmaddwd m0, [r5 + 0 *32] ;m0=[0+1] Row1 + + lea r0, [r0 + 2 * r1] + movh m2, [r0] + punpcklwd m1, m2 ;m1=[1 2] + pmaddwd m1, [r5 + 0 *32] ;m1=[1+2] Row2 + + movh m3, [r0 + r1] + punpcklwd m2, m3 ;m4=[2 3] + pmaddwd m2, [r5 + 1 * 32] + paddd m0, m2 ;m0=[0+1+2+3] Row1 done + + movh m2, [r0 + 2 * r1] + punpcklwd m3, m2 ;m5=[3 4] + pmaddwd m3, [r5 + 1 * 32] + paddd m1, m3 ;m1=[1+2+3+4] Row2 done + +%ifidn %2, ss + psrad m0, 6 + psrad m1, 6 + packssdw m0, m1 +%elifidn %2, ps + paddd m0, m4 + paddd m1, m4 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + packssdw m0, m1 +%else + paddd m0, m4 + paddd m1, m4 + %ifidn %2, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + %else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + %endif + packusdw m0, m1 + CLIPW m0, m6, m5 +%endif + + movh [r2], m0 + movhps [r2 + r3], m0 + +%ifnidn %2, 2 + lea r2, [r2 + r3 * 2] + dec r4d + jnz .loop +%endif + RET +%endmacro + +FILTER_VER_CHROMA_W4 2, ss, 4 +FILTER_VER_CHROMA_W4 2, pp, 7 +FILTER_VER_CHROMA_W4 2, ps, 5 +FILTER_VER_CHROMA_W4 2, sp, 7 + +FILTER_VER_CHROMA_W4 4, ss, 4 +FILTER_VER_CHROMA_W4 4, pp, 7 +FILTER_VER_CHROMA_W4 4, ps, 5 +FILTER_VER_CHROMA_W4 4, sp, 7 + +;------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vertical_%1_6x8(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W6 3 +INIT_XMM sse4 +cglobal interp_4tap_vert_%2_6x%1, 5, 7, %3 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r6, [r5 + r4] +%else + lea r6, [tab_ChromaCoeffV + r4] +%endif + + mov r4d, %1/4 + +%ifnidn %2, ss + %ifnidn %2, ps + mova m7, [pw_pixel_max] + %ifidn %2, pp + mova m6, [INTERP_OFFSET_PP] + %else + mova m6, [INTERP_OFFSET_SP] + %endif + %else + mova m6, [INTERP_OFFSET_PS] + %endif +%endif + +.loopH: + PROCESS_CHROMA_SP_W4_4R + +%ifidn %2, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%elifidn %2, ps + paddd m0, m6 + paddd m1, m6 + paddd m2, m6 + paddd m3, m6 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%else + paddd m0, m6 + paddd m1, m6 + paddd m2, m6 + paddd m3, m6 + %ifidn %2, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + %else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + %endif + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + CLIPW2 m0, m2, m5, m7 +%endif + + movh [r2], m0 + movhps [r2 + r3], m0 + lea r5, [r2 + 2 * r3] + movh [r5], m2 + movhps [r5 + r3], m2 + + lea r5, [4 * r1 - 2 * 4] + sub r0, r5 + add r2, 2 * 4 + + PROCESS_CHROMA_SP_W2_4R r6 + +%ifidn %2, ss + psrad m0, 6 + psrad m2, 6 + packssdw m0, m2 +%elifidn %2, ps + paddd m0, m6 + paddd m2, m6 + psrad m0, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + packssdw m0, m2 +%else + paddd m0, m6 + paddd m2, m6 + %ifidn %2, pp + psrad m0, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + %else + psrad m0, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + %endif + packusdw m0, m2 + CLIPW m0, m5, m7 +%endif + + movd [r2], m0 + pextrd [r2 + r3], m0, 1 + lea r2, [r2 + 2 * r3] + pextrd [r2], m0, 2 + pextrd [r2 + r3], m0, 3 + + sub r0, 2 * 4 + lea r2, [r2 + 2 * r3 - 2 * 4] + + dec r4d + jnz .loopH + RET +%endmacro + +FILTER_VER_CHROMA_W6 8, ss, 6 +FILTER_VER_CHROMA_W6 8, ps, 7 +FILTER_VER_CHROMA_W6 8, sp, 8 +FILTER_VER_CHROMA_W6 8, pp, 8 + +FILTER_VER_CHROMA_W6 16, ss, 6 +FILTER_VER_CHROMA_W6 16, ps, 7 +FILTER_VER_CHROMA_W6 16, sp, 8 +FILTER_VER_CHROMA_W6 16, pp, 8 + +%macro PROCESS_CHROMA_SP_W8_2R 0 + movu m1, [r0] + movu m3, [r0 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, [r5 + 0 * 32] ;m0 = [0l+1l] Row1l + punpckhwd m1, m3 + pmaddwd m1, [r5 + 0 * 32] ;m1 = [0h+1h] Row1h + + movu m4, [r0 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, [r5 + 0 * 32] ;m2 = [1l+2l] Row2l + punpckhwd m3, m4 + pmaddwd m3, [r5 + 0 * 32] ;m3 = [1h+2h] Row2h + + lea r0, [r0 + 2 * r1] + movu m5, [r0 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + 1 * 32] ;m6 = [2l+3l] Row1l + paddd m0, m6 ;m0 = [0l+1l+2l+3l] Row1l sum + punpckhwd m4, m5 + pmaddwd m4, [r5 + 1 * 32] ;m6 = [2h+3h] Row1h + paddd m1, m4 ;m1 = [0h+1h+2h+3h] Row1h sum + + movu m4, [r0 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + 1 * 32] ;m6 = [3l+4l] Row2l + paddd m2, m6 ;m2 = [1l+2l+3l+4l] Row2l sum + punpckhwd m5, m4 + pmaddwd m5, [r5 + 1 * 32] ;m1 = [3h+4h] Row2h + paddd m3, m5 ;m3 = [1h+2h+3h+4h] Row2h sum +%endmacro + +;---------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_%3_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;---------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W8 4 +INIT_XMM sse2 +cglobal interp_4tap_vert_%3_%1x%2, 5, 6, %4 + + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + + mov r4d, %2/2 + +%ifidn %3, pp + mova m7, [INTERP_OFFSET_PP] +%elifidn %3, sp + mova m7, [INTERP_OFFSET_SP] +%elifidn %3, ps + mova m7, [INTERP_OFFSET_PS] +%endif + +.loopH: + PROCESS_CHROMA_SP_W8_2R + +%ifidn %3, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%elifidn %3, ps + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%else + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + %ifidn %3, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + %else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + %endif + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + mova m6, [pw_pixel_max] + CLIPW2 m0, m2, m5, m6 +%endif + + movu [r2], m0 + movu [r2 + r3], m2 + + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loopH + RET +%endmacro + +FILTER_VER_CHROMA_W8 8, 2, ss, 7 +FILTER_VER_CHROMA_W8 8, 4, ss, 7 +FILTER_VER_CHROMA_W8 8, 6, ss, 7 +FILTER_VER_CHROMA_W8 8, 8, ss, 7 +FILTER_VER_CHROMA_W8 8, 16, ss, 7 +FILTER_VER_CHROMA_W8 8, 32, ss, 7 + +FILTER_VER_CHROMA_W8 8, 2, sp, 8 +FILTER_VER_CHROMA_W8 8, 4, sp, 8 +FILTER_VER_CHROMA_W8 8, 6, sp, 8 +FILTER_VER_CHROMA_W8 8, 8, sp, 8 +FILTER_VER_CHROMA_W8 8, 16, sp, 8 +FILTER_VER_CHROMA_W8 8, 32, sp, 8 + +FILTER_VER_CHROMA_W8 8, 2, ps, 8 +FILTER_VER_CHROMA_W8 8, 4, ps, 8 +FILTER_VER_CHROMA_W8 8, 6, ps, 8 +FILTER_VER_CHROMA_W8 8, 8, ps, 8 +FILTER_VER_CHROMA_W8 8, 16, ps, 8 +FILTER_VER_CHROMA_W8 8, 32, ps, 8 + +FILTER_VER_CHROMA_W8 8, 2, pp, 8 +FILTER_VER_CHROMA_W8 8, 4, pp, 8 +FILTER_VER_CHROMA_W8 8, 6, pp, 8 +FILTER_VER_CHROMA_W8 8, 8, pp, 8 +FILTER_VER_CHROMA_W8 8, 16, pp, 8 +FILTER_VER_CHROMA_W8 8, 32, pp, 8 + +FILTER_VER_CHROMA_W8 8, 12, ss, 7 +FILTER_VER_CHROMA_W8 8, 64, ss, 7 +FILTER_VER_CHROMA_W8 8, 12, sp, 8 +FILTER_VER_CHROMA_W8 8, 64, sp, 8 +FILTER_VER_CHROMA_W8 8, 12, ps, 8 +FILTER_VER_CHROMA_W8 8, 64, ps, 8 +FILTER_VER_CHROMA_W8 8, 12, pp, 8 +FILTER_VER_CHROMA_W8 8, 64, pp, 8 + +%macro PROCESS_CHROMA_VERT_W16_2R 0 + movu m1, [r0] + movu m3, [r0 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, [r5 + 0 * 32] + punpckhwd m1, m3 + pmaddwd m1, [r5 + 0 * 32] + + movu m4, [r0 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, [r5 + 0 * 32] + punpckhwd m3, m4 + pmaddwd m3, [r5 + 0 * 32] + + lea r0, [r0 + 2 * r1] + movu m5, [r0 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + 1 * 32] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + 1 * 32] + paddd m1, m4 + + movu m4, [r0 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + 1 * 32] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + 1 * 32] + paddd m3, m5 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_AVX2_6xN 2 +INIT_YMM avx2 +%if ARCH_X86_64 +cglobal interp_4tap_vert_%2_6x%1, 4, 7, 10 + mov r4d, r4m + add r1d, r1d + add r3d, r3d + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + + sub r0, r1 + mov r6d, %1/4 + +%ifidn %2,pp + vbroadcasti128 m8, [INTERP_OFFSET_PP] +%elifidn %2, sp + vbroadcasti128 m8, [INTERP_OFFSET_SP] +%else + vbroadcasti128 m8, [INTERP_OFFSET_PS] +%endif + +.loopH: + movu xm0, [r0] + movu xm1, [r0 + r1] + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + + movu xm2, [r0 + r1 * 2] + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + + lea r4, [r1 * 3] + movu xm3, [r0 + r4] + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m4 + + lea r0, [r0 + r1 * 4] + movu xm4, [r0] + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + pmaddwd m3, [r5] + paddd m1, m5 + + movu xm5, [r0 + r1] + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + pmaddwd m4, [r5] + paddd m2, m6 + + movu xm6, [r0 + r1 * 2] + punpckhwd xm7, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddwd m7, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m7 + lea r4, [r3 * 3] +%ifidn %2,ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 +%else + paddd m0, m8 + paddd m1, m8 + paddd m2, m8 + paddd m3, m8 +%ifidn %2,pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP +%elifidn %2, sp + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS +%endif +%endif + + packssdw m0, m1 + packssdw m2, m3 + vpermq m0, m0, q3120 + vpermq m2, m2, q3120 + pxor m5, m5 + mova m9, [pw_pixel_max] +%ifidn %2,pp + CLIPW m0, m5, m9 + CLIPW m2, m5, m9 +%elifidn %2, sp + CLIPW m0, m5, m9 + CLIPW m2, m5, m9 +%endif + + vextracti128 xm1, m0, 1 + vextracti128 xm3, m2, 1 + movq [r2], xm0 + pextrd [r2 + 8], xm0, 2 + movq [r2 + r3], xm1 + pextrd [r2 + r3 + 8], xm1, 2 + movq [r2 + r3 * 2], xm2 + pextrd [r2 + r3 * 2 + 8], xm2, 2 + movq [r2 + r4], xm3 + pextrd [r2 + r4 + 8], xm3, 2 + + lea r2, [r2 + r3 * 4] + dec r6d + jnz .loopH + RET +%endif +%endmacro +FILTER_VER_CHROMA_AVX2_6xN 8, pp +FILTER_VER_CHROMA_AVX2_6xN 8, ps +FILTER_VER_CHROMA_AVX2_6xN 8, ss +FILTER_VER_CHROMA_AVX2_6xN 8, sp +FILTER_VER_CHROMA_AVX2_6xN 16, pp +FILTER_VER_CHROMA_AVX2_6xN 16, ps +FILTER_VER_CHROMA_AVX2_6xN 16, ss +FILTER_VER_CHROMA_AVX2_6xN 16, sp + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W16_16xN_avx2 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%2_16x%1, 5, 6, %3 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + + mov r4d, %1/2 + +%ifidn %2, pp + vbroadcasti128 m7, [INTERP_OFFSET_PP] +%elifidn %2, sp + vbroadcasti128 m7, [INTERP_OFFSET_SP] +%elifidn %2, ps + vbroadcasti128 m7, [INTERP_OFFSET_PS] +%endif + +.loopH: + PROCESS_CHROMA_VERT_W16_2R +%ifidn %2, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%elifidn %2, ps + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%else + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + %ifidn %2, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP +%else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP +%endif + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + CLIPW2 m0, m2, m5, [pw_pixel_max] +%endif + + movu [r2], m0 + movu [r2 + r3], m2 + lea r2, [r2 + 2 * r3] + dec r4d + jnz .loopH + RET +%endmacro + FILTER_VER_CHROMA_W16_16xN_avx2 4, pp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 8, pp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 12, pp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 24, pp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 16, pp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 32, pp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 64, pp, 8 + + FILTER_VER_CHROMA_W16_16xN_avx2 4, ps, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 8, ps, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 12, ps, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 24, ps, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 16, ps, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 32, ps, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 64, ps, 8 + + FILTER_VER_CHROMA_W16_16xN_avx2 4, ss, 7 + FILTER_VER_CHROMA_W16_16xN_avx2 8, ss, 7 + FILTER_VER_CHROMA_W16_16xN_avx2 12, ss, 7 + FILTER_VER_CHROMA_W16_16xN_avx2 24, ss, 7 + FILTER_VER_CHROMA_W16_16xN_avx2 16, ss, 7 + FILTER_VER_CHROMA_W16_16xN_avx2 32, ss, 7 + FILTER_VER_CHROMA_W16_16xN_avx2 64, ss, 7 + + FILTER_VER_CHROMA_W16_16xN_avx2 4, sp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 8, sp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 12, sp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 24, sp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 16, sp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 32, sp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 64, sp, 8 + +%macro PROCESS_CHROMA_VERT_W32_2R 0 + movu m1, [r0] + movu m3, [r0 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, [r5 + 0 * mmsize] + punpckhwd m1, m3 + pmaddwd m1, [r5 + 0 * mmsize] + + movu m9, [r0 + mmsize] + movu m11, [r0 + r1 + mmsize] + punpcklwd m8, m9, m11 + pmaddwd m8, [r5 + 0 * mmsize] + punpckhwd m9, m11 + pmaddwd m9, [r5 + 0 * mmsize] + + movu m4, [r0 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, [r5 + 0 * mmsize] + punpckhwd m3, m4 + pmaddwd m3, [r5 + 0 * mmsize] + + movu m12, [r0 + 2 * r1 + mmsize] + punpcklwd m10, m11, m12 + pmaddwd m10, [r5 + 0 * mmsize] + punpckhwd m11, m12 + pmaddwd m11, [r5 + 0 * mmsize] + + lea r6, [r0 + 2 * r1] + movu m5, [r6 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m1, m4 + + movu m13, [r6 + r1 + mmsize] + punpcklwd m14, m12, m13 + pmaddwd m14, [r5 + 1 * mmsize] + paddd m8, m14 + punpckhwd m12, m13 + pmaddwd m12, [r5 + 1 * mmsize] + paddd m9, m12 + + movu m4, [r6 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m3, m5 + + movu m12, [r6 + 2 * r1 + mmsize] + punpcklwd m14, m13, m12 + pmaddwd m14, [r5 + 1 * mmsize] + paddd m10, m14 + punpckhwd m13, m12 + pmaddwd m13, [r5 + 1 * mmsize] + paddd m11, m13 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W16_32xN_avx2 3 +INIT_YMM avx2 +%if ARCH_X86_64 +cglobal interp_4tap_vert_%2_32x%1, 5, 7, %3 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + mov r4d, %1/2 + +%ifidn %2, pp + vbroadcasti128 m7, [INTERP_OFFSET_PP] +%elifidn %2, sp + vbroadcasti128 m7, [INTERP_OFFSET_SP] +%elifidn %2, ps + vbroadcasti128 m7, [INTERP_OFFSET_PS] +%endif + +.loopH: + PROCESS_CHROMA_VERT_W32_2R +%ifidn %2, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + psrad m8, 6 + psrad m9, 6 + psrad m10, 6 + psrad m11, 6 + + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 +%elifidn %2, ps + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + paddd m8, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + psrad m8, INTERP_SHIFT_PS + psrad m9, INTERP_SHIFT_PS + psrad m10, INTERP_SHIFT_PS + psrad m11, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 +%else + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + paddd m8, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + %ifidn %2, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + psrad m8, INTERP_SHIFT_PP + psrad m9, INTERP_SHIFT_PP + psrad m10, INTERP_SHIFT_PP + psrad m11, INTERP_SHIFT_PP +%else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + psrad m8, INTERP_SHIFT_SP + psrad m9, INTERP_SHIFT_SP + psrad m10, INTERP_SHIFT_SP + psrad m11, INTERP_SHIFT_SP +%endif + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 + pxor m5, m5 + CLIPW2 m0, m2, m5, [pw_pixel_max] + CLIPW2 m8, m10, m5, [pw_pixel_max] +%endif + + movu [r2], m0 + movu [r2 + r3], m2 + movu [r2 + mmsize], m8 + movu [r2 + r3 + mmsize], m10 + lea r2, [r2 + 2 * r3] + lea r0, [r0 + 2 * r1] + dec r4d + jnz .loopH + RET +%endif +%endmacro + FILTER_VER_CHROMA_W16_32xN_avx2 8, pp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 16, pp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 24, pp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 32, pp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 48, pp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 64, pp, 15 + + FILTER_VER_CHROMA_W16_32xN_avx2 8, ps, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 16, ps, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 24, ps, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 32, ps, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 48, ps, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 64, ps, 15 + + FILTER_VER_CHROMA_W16_32xN_avx2 8, ss, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 16, ss, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 24, ss, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 32, ss, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 48, ss, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 64, ss, 15 + + FILTER_VER_CHROMA_W16_32xN_avx2 8, sp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 16, sp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 24, sp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 32, sp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 48, sp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 64, sp, 15 + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W16_64xN_avx2 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%2_64x%1, 5, 7, %3 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + mov r4d, %1/2 + +%ifidn %2, pp + vbroadcasti128 m7, [INTERP_OFFSET_PP] +%elifidn %2, sp + vbroadcasti128 m7, [INTERP_OFFSET_SP] +%elifidn %2, ps + vbroadcasti128 m7, [INTERP_OFFSET_PS] +%endif + +.loopH: +%assign x 0 +%rep 4 + movu m1, [r0 + x] + movu m3, [r0 + r1 + x] + movu m5, [r5 + 0 * mmsize] + punpcklwd m0, m1, m3 + pmaddwd m0, m5 + punpckhwd m1, m3 + pmaddwd m1, m5 + + movu m4, [r0 + 2 * r1 + x] + punpcklwd m2, m3, m4 + pmaddwd m2, m5 + punpckhwd m3, m4 + pmaddwd m3, m5 + + lea r6, [r0 + 2 * r1] + movu m5, [r6 + r1 + x] + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m1, m4 + + movu m4, [r6 + 2 * r1 + x] + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m3, m5 + +%ifidn %2, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%elifidn %2, ps + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%else + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 +%ifidn %2, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP +%else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP +%endif + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + CLIPW2 m0, m2, m5, [pw_pixel_max] +%endif + + movu [r2 + x], m0 + movu [r2 + r3 + x], m2 +%assign x x+mmsize +%endrep + + lea r2, [r2 + 2 * r3] + lea r0, [r0 + 2 * r1] + dec r4d + jnz .loopH + RET +%endmacro + FILTER_VER_CHROMA_W16_64xN_avx2 16, ss, 7 + FILTER_VER_CHROMA_W16_64xN_avx2 32, ss, 7 + FILTER_VER_CHROMA_W16_64xN_avx2 48, ss, 7 + FILTER_VER_CHROMA_W16_64xN_avx2 64, ss, 7 + FILTER_VER_CHROMA_W16_64xN_avx2 16, sp, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 32, sp, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 48, sp, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 64, sp, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 16, ps, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 32, ps, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 48, ps, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 64, ps, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 16, pp, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 32, pp, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 48, pp, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 64, pp, 8 + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W16_12xN_avx2 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%2_12x%1, 5, 8, %3 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + mov r4d, %1/2 + +%ifidn %2, pp + vbroadcasti128 m7, [INTERP_OFFSET_PP] +%elifidn %2, sp + vbroadcasti128 m7, [INTERP_OFFSET_SP] +%elifidn %2, ps + vbroadcasti128 m7, [INTERP_OFFSET_PS] +%endif + +.loopH: + PROCESS_CHROMA_VERT_W16_2R +%ifidn %2, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%elifidn %2, ps + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%else + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + %ifidn %2, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP +%else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP +%endif + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + CLIPW2 m0, m2, m5, [pw_pixel_max] +%endif + + movu [r2], xm0 + movu [r2 + r3], xm2 + vextracti128 xm0, m0, 1 + vextracti128 xm2, m2, 1 + movq [r2 + 16], xm0 + movq [r2 + r3 + 16], xm2 + lea r2, [r2 + 2 * r3] + dec r4d + jnz .loopH + RET +%endmacro + FILTER_VER_CHROMA_W16_12xN_avx2 16, ss, 7 + FILTER_VER_CHROMA_W16_12xN_avx2 16, sp, 8 + FILTER_VER_CHROMA_W16_12xN_avx2 16, ps, 8 + FILTER_VER_CHROMA_W16_12xN_avx2 16, pp, 8 + FILTER_VER_CHROMA_W16_12xN_avx2 32, ss, 7 + FILTER_VER_CHROMA_W16_12xN_avx2 32, sp, 8 + FILTER_VER_CHROMA_W16_12xN_avx2 32, ps, 8 + FILTER_VER_CHROMA_W16_12xN_avx2 32, pp, 8 + +%macro PROCESS_CHROMA_VERT_W24_2R 0 + movu m1, [r0] + movu m3, [r0 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, [r5 + 0 * mmsize] + punpckhwd m1, m3 + pmaddwd m1, [r5 + 0 * mmsize] + + movu xm9, [r0 + mmsize] + movu xm11, [r0 + r1 + mmsize] + punpcklwd xm8, xm9, xm11 + pmaddwd xm8, [r5 + 0 * mmsize] + punpckhwd xm9, xm11 + pmaddwd xm9, [r5 + 0 * mmsize] + + movu m4, [r0 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, [r5 + 0 * mmsize] + punpckhwd m3, m4 + pmaddwd m3, [r5 + 0 * mmsize] + + movu xm12, [r0 + 2 * r1 + mmsize] + punpcklwd xm10, xm11, xm12 + pmaddwd xm10, [r5 + 0 * mmsize] + punpckhwd xm11, xm12 + pmaddwd xm11, [r5 + 0 * mmsize] + + lea r6, [r0 + 2 * r1] + movu m5, [r6 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m1, m4 + + movu xm13, [r6 + r1 + mmsize] + punpcklwd xm14, xm12, xm13 + pmaddwd xm14, [r5 + 1 * mmsize] + paddd xm8, xm14 + punpckhwd xm12, xm13 + pmaddwd xm12, [r5 + 1 * mmsize] + paddd xm9, xm12 + + movu m4, [r6 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m3, m5 + + movu xm12, [r6 + 2 * r1 + mmsize] + punpcklwd xm14, xm13, xm12 + pmaddwd xm14, [r5 + 1 * mmsize] + paddd xm10, xm14 + punpckhwd xm13, xm12 + pmaddwd xm13, [r5 + 1 * mmsize] + paddd xm11, xm13 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W16_24xN_avx2 3 +INIT_YMM avx2 +%if ARCH_X86_64 +cglobal interp_4tap_vert_%2_24x%1, 5, 7, %3 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + mov r4d, %1/2 + +%ifidn %2, pp + vbroadcasti128 m7, [INTERP_OFFSET_PP] +%elifidn %2, sp + vbroadcasti128 m7, [INTERP_OFFSET_SP] +%elifidn %2, ps + vbroadcasti128 m7, [INTERP_OFFSET_PS] +%endif + +.loopH: + PROCESS_CHROMA_VERT_W24_2R +%ifidn %2, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + psrad m8, 6 + psrad m9, 6 + psrad m10, 6 + psrad m11, 6 + + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 +%elifidn %2, ps + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + paddd m8, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + psrad m8, INTERP_SHIFT_PS + psrad m9, INTERP_SHIFT_PS + psrad m10, INTERP_SHIFT_PS + psrad m11, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 +%else + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + paddd m8, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + %ifidn %2, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + psrad m8, INTERP_SHIFT_PP + psrad m9, INTERP_SHIFT_PP + psrad m10, INTERP_SHIFT_PP + psrad m11, INTERP_SHIFT_PP +%else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + psrad m8, INTERP_SHIFT_SP + psrad m9, INTERP_SHIFT_SP + psrad m10, INTERP_SHIFT_SP + psrad m11, INTERP_SHIFT_SP +%endif + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 + pxor m5, m5 + CLIPW2 m0, m2, m5, [pw_pixel_max] + CLIPW2 m8, m10, m5, [pw_pixel_max] +%endif + + movu [r2], m0 + movu [r2 + r3], m2 + movu [r2 + mmsize], xm8 + movu [r2 + r3 + mmsize], xm10 + lea r2, [r2 + 2 * r3] + lea r0, [r0 + 2 * r1] + dec r4d + jnz .loopH + RET +%endif +%endmacro + FILTER_VER_CHROMA_W16_24xN_avx2 32, ss, 15 + FILTER_VER_CHROMA_W16_24xN_avx2 32, sp, 15 + FILTER_VER_CHROMA_W16_24xN_avx2 32, ps, 15 + FILTER_VER_CHROMA_W16_24xN_avx2 32, pp, 15 + FILTER_VER_CHROMA_W16_24xN_avx2 64, ss, 15 + FILTER_VER_CHROMA_W16_24xN_avx2 64, sp, 15 + FILTER_VER_CHROMA_W16_24xN_avx2 64, ps, 15 + FILTER_VER_CHROMA_W16_24xN_avx2 64, pp, 15 + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W16_48x64_avx2 2 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_48x64, 5, 7, %2 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + mov r4d, 32 + +%ifidn %1, pp + vbroadcasti128 m7, [INTERP_OFFSET_PP] +%elifidn %1, sp + vbroadcasti128 m7, [INTERP_OFFSET_SP] +%elifidn %1, ps + vbroadcasti128 m7, [INTERP_OFFSET_PS] +%endif + +.loopH: +%assign x 0 +%rep 3 + movu m1, [r0 + x] + movu m3, [r0 + r1 + x] + movu m5, [r5 + 0 * mmsize] + punpcklwd m0, m1, m3 + pmaddwd m0, m5 + punpckhwd m1, m3 + pmaddwd m1, m5 + + movu m4, [r0 + 2 * r1 + x] + punpcklwd m2, m3, m4 + pmaddwd m2, m5 + punpckhwd m3, m4 + pmaddwd m3, m5 + + lea r6, [r0 + 2 * r1] + movu m5, [r6 + r1 + x] + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m1, m4 + + movu m4, [r6 + 2 * r1 + x] + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m3, m5 + +%ifidn %1, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%elifidn %1, ps + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%else + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 +%ifidn %1, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP +%else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP +%endif + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + CLIPW2 m0, m2, m5, [pw_pixel_max] +%endif + + movu [r2 + x], m0 + movu [r2 + r3 + x], m2 +%assign x x+mmsize +%endrep + + lea r2, [r2 + 2 * r3] + lea r0, [r0 + 2 * r1] + dec r4d + jnz .loopH + RET +%endmacro + + FILTER_VER_CHROMA_W16_48x64_avx2 pp, 8 + FILTER_VER_CHROMA_W16_48x64_avx2 ps, 8 + FILTER_VER_CHROMA_W16_48x64_avx2 ss, 7 + FILTER_VER_CHROMA_W16_48x64_avx2 sp, 8 + +INIT_XMM sse2 +cglobal chroma_p2s, 3, 7, 3 + ; load width and height + mov r3d, r3m + mov r4d, r4m + add r1, r1 + + ; load constant + mova m2, [tab_c_n8192] + +.loopH: + + xor r5d, r5d +.loopW: + lea r6, [r0 + r5 * 2] + + movu m0, [r6] + psllw m0, (14 - BIT_DEPTH) + paddw m0, m2 + + movu m1, [r6 + r1] + psllw m1, (14 - BIT_DEPTH) + paddw m1, m2 + + add r5d, 8 + cmp r5d, r3d + lea r6, [r2 + r5 * 2] + jg .width4 + movu [r6 + FENC_STRIDE / 2 * 0 - 16], m0 + movu [r6 + FENC_STRIDE / 2 * 2 - 16], m1 + je .nextH + jmp .loopW + +.width4: + test r3d, 4 + jz .width2 + test r3d, 2 + movh [r6 + FENC_STRIDE / 2 * 0 - 16], m0 + movh [r6 + FENC_STRIDE / 2 * 2 - 16], m1 + lea r6, [r6 + 8] + pshufd m0, m0, 2 + pshufd m1, m1, 2 + jz .nextH + +.width2: + movd [r6 + FENC_STRIDE / 2 * 0 - 16], m0 + movd [r6 + FENC_STRIDE / 2 * 2 - 16], m1 + +.nextH: + lea r0, [r0 + r1 * 2] + add r2, FENC_STRIDE / 2 * 4 + + sub r4d, 2 + jnz .loopH + RET %macro PROCESS_LUMA_VER_W4_4R 0 movq m0, [r0] @@ -3517,7 +9103,7 @@ ; load constant mova m2, [pw_2000] -.loop: +.loop movu m0, [r0] movu m1, [r0 + r1] psllw m0, (14 - BIT_DEPTH) @@ -3570,7 +9156,7 @@ ; load constant mova m1, [pw_2000] -.loop: +.loop movu m0, [r0] psllw m0, (14 - BIT_DEPTH) psubw m0, m1 @@ -3691,7 +9277,7 @@ ; load constant mova m2, [pw_2000] -.loop: +.loop movu m0, [r0] movu m1, [r0 + r1] psllw m0, (14 - BIT_DEPTH) @@ -3765,7 +9351,7 @@ ; load constant mova m2, [pw_2000] -.loop: +.loop movu m0, [r0] movu m1, [r0 + r1] psllw m0, (14 - BIT_DEPTH) @@ -3819,7 +9405,7 @@ ; load constant mova m4, [pw_2000] -.loop: +.loop movu m0, [r0] movu m1, [r0 + r1] movu m2, [r0 + r1 * 2] @@ -3924,7 +9510,7 @@ ; load constant mova m2, [pw_2000] -.loop: +.loop movu m0, [r0] movu m1, [r0 + r1] psllw m0, (14 - BIT_DEPTH) @@ -3997,7 +9583,7 @@ ; load constant mova m4, [pw_2000] -.loop: +.loop movu m0, [r0] movu m1, [r0 + r1] movu m2, [r0 + r1 * 2] @@ -4172,7 +9758,7 @@ ; load constant mova m2, [pw_2000] -.loop: +.loop movu m0, [r0] movu m1, [r0 + r1] psllw m0, (14 - BIT_DEPTH) @@ -4283,7 +9869,7 @@ ; load constant mova m4, [pw_2000] -.loop: +.loop movu m0, [r0] movu m1, [r0 + r1] movu m2, [r0 + r1 * 2] @@ -4366,7 +9952,7 @@ ; load constant mova m2, [pw_2000] -.loop: +.loop movu m0, [r0] movu m1, [r0 + 32] psllw m0, (14 - BIT_DEPTH) @@ -4431,7 +10017,7 @@ ; load constant mova m2, [pw_2000] -.loop: +.loop movu m0, [r0] movu m1, [r0 + r1] psllw m0, (14 - BIT_DEPTH) @@ -4495,7 +10081,7 @@ ; load constant mova m4, [pw_2000] -.loop: +.loop movu m0, [r0] movu m1, [r0 + r1] movu m2, [r0 + r1 * 2] @@ -4611,3 +10197,2811 @@ jnz .loop RET +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal filterPixelToShort_48x64, 3, 7, 4 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load height + mov r6d, 16 + + ; load constant + mova m3, [pw_2000] + +.loop + movu m0, [r0] + movu m1, [r0 + 32] + movu m2, [r0 + 64] + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psubw m0, m3 + psubw m1, m3 + psubw m2, m3 + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 0 + 32], m1 + movu [r2 + r3 * 0 + 64], m2 + + movu m0, [r0 + r1] + movu m1, [r0 + r1 + 32] + movu m2, [r0 + r1 + 64] + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psubw m0, m3 + psubw m1, m3 + psubw m2, m3 + movu [r2 + r3 * 1], m0 + movu [r2 + r3 * 1 + 32], m1 + movu [r2 + r3 * 1 + 64], m2 + + movu m0, [r0 + r1 * 2] + movu m1, [r0 + r1 * 2 + 32] + movu m2, [r0 + r1 * 2 + 64] + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psubw m0, m3 + psubw m1, m3 + psubw m2, m3 + movu [r2 + r3 * 2], m0 + movu [r2 + r3 * 2 + 32], m1 + movu [r2 + r3 * 2 + 64], m2 + + movu m0, [r0 + r5] + movu m1, [r0 + r5 + 32] + movu m2, [r0 + r5 + 64] + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psubw m0, m3 + psubw m1, m3 + psubw m2, m3 + movu [r2 + r4], m0 + movu [r2 + r4 + 32], m1 + movu [r2 + r4 + 64], m2 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET + + +;----------------------------------------------------------------------------------------------------------------------------- +;void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- + +%macro IPFILTER_LUMA_PS_4xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_4x%1, 6,8,7 + mov r5d, r5m + mov r4d, r4m + add r1d, r1d + add r3d, r3d + +%ifdef PIC + lea r6, [tab_LumaCoeff] + lea r4, [r4 * 8] + vbroadcasti128 m0, [r6 + r4 * 2] +%else + lea r4, [r4 * 8] + vbroadcasti128 m0, [tab_LumaCoeff + r4 * 2] +%endif + + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - pw_2000 + + sub r0, 6 + test r5d, r5d + mov r7d, %1 ; loop count variable - height + jz .preloop + lea r6, [r1 * 3] ; r6 = (N / 2 - 1) * srcStride + sub r0, r6 ; r0(src) - 3 * srcStride + add r7d, 6 ;7 - 1(since last row not in loop) ; need extra 7 rows, just set a specially flag here, blkheight += N - 1 (7 - 3 = 4 ; since the last three rows not in loop) + +.preloop: + lea r6, [r3 * 3] +.loop + ; Row 0 + movu xm3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + movu xm4, [r0 + 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + vinserti128 m3, m3, xm4, 1 + movu xm4, [r0 + 4] + movu xm5, [r0 + 6] + vinserti128 m4, m4, xm5, 1 + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] + + ; Row 1 + movu xm4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + movu xm5, [r0 + r1 + 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + vinserti128 m4, m4, xm5, 1 + movu xm5, [r0 + r1 + 4] + movu xm6, [r0 + r1 + 6] + vinserti128 m5, m5, xm6, 1 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] + phaddd m3, m4 ; all rows and col completed. + + mova m5, [interp8_hps_shuf] + vpermd m3, m5, m3 + paddd m3, m2 + vextracti128 xm4, m3, 1 + psrad xm3, INTERP_SHIFT_PS + psrad xm4, INTERP_SHIFT_PS + packssdw xm3, xm3 + packssdw xm4, xm4 + + movq [r2], xm3 ;row 0 + movq [r2 + r3], xm4 ;row 1 + lea r0, [r0 + r1 * 2] ; first loop src ->5th row(i.e 4) + lea r2, [r2 + r3 * 2] ; first loop dst ->5th row(i.e 4) + + sub r7d, 2 + jg .loop + test r5d, r5d + jz .end + + ; Row 10 + movu xm3, [r0] + movu xm4, [r0 + 2] + vinserti128 m3, m3, xm4, 1 + movu xm4, [r0 + 4] + movu xm5, [r0 + 6] + vinserti128 m4, m4, xm5, 1 + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + + ; Row11 + phaddd m3, m4 ; all rows and col completed. + + mova m5, [interp8_hps_shuf] + vpermd m3, m5, m3 + paddd m3, m2 + vextracti128 xm4, m3, 1 + psrad xm3, INTERP_SHIFT_PS + psrad xm4, INTERP_SHIFT_PS + packssdw xm3, xm3 + packssdw xm4, xm4 + + movq [r2], xm3 ;row 0 +.end + RET +%endif +%endmacro + + IPFILTER_LUMA_PS_4xN_AVX2 4 + IPFILTER_LUMA_PS_4xN_AVX2 8 + IPFILTER_LUMA_PS_4xN_AVX2 16 + +%macro IPFILTER_LUMA_PS_8xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_8x%1, 4, 6, 8 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + shl r4d, 4 +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4] + vpbroadcastq m1, [r6 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 6 + test r5d, r5d + mov r4d, %1 + jz .loop0 + lea r6, [r1*3] + sub r0, r6 + add r4d, 7 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m7, m5, m3 + pmaddwd m4, m0 + pmaddwd m7, m1 + paddd m4, m7 + + vbroadcasti128 m6, [r0 + 16] + pshufb m5, m3 + pshufb m6, m3 + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m2 + vextracti128 xm5,m4, 1 + psrad xm4, INTERP_SHIFT_PS + psrad xm5, INTERP_SHIFT_PS + packssdw xm4, xm5 + + movu [r2], xm4 + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + + IPFILTER_LUMA_PS_8xN_AVX2 4 + IPFILTER_LUMA_PS_8xN_AVX2 8 + IPFILTER_LUMA_PS_8xN_AVX2 16 + IPFILTER_LUMA_PS_8xN_AVX2 32 + +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_24x32, 4, 6, 8 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + shl r4d, 4 +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4] + vpbroadcastq m1, [r6 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 6 + test r5d, r5d + mov r4d, 32 + jz .loop0 + lea r6, [r1*3] + sub r0, r6 + add r4d, 7 + +.loop0: +%assign x 0 +%rep 24/8 + vbroadcasti128 m4, [r0 + x] + vbroadcasti128 m5, [r0 + 8 + x] + pshufb m4, m3 + pshufb m7, m5, m3 + pmaddwd m4, m0 + pmaddwd m7, m1 + paddd m4, m7 + + vbroadcasti128 m6, [r0 + 16 + x] + pshufb m5, m3 + pshufb m6, m3 + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m2 + vextracti128 xm5,m4, 1 + psrad xm4, INTERP_SHIFT_PS + psrad xm5, INTERP_SHIFT_PS + packssdw xm4, xm5 + + movu [r2 + x], xm4 + %assign x x+16 + %endrep + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif + + +%macro IPFILTER_LUMA_PS_32_64_AVX2 2 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_%1x%2, 4, 6, 8 + + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + shl r4d, 6 +%ifdef PIC + lea r6, [tab_LumaCoeffV] + movu m0, [r6 + r4] + movu m1, [r6 + r4 + mmsize] +%else + movu m0, [tab_LumaCoeffV + r4] + movu m1, [tab_LumaCoeffV + r4 + mmsize] +%endif + mova m3, [interp8_hpp_shuf_new] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 6 + test r5d, r5d + mov r4d, %2 + jz .loop0 + lea r6, [r1*3] + sub r0, r6 + add r4d, 7 + +.loop0: +%assign x 0 +%rep %1/16 + vbroadcasti128 m4, [r0 + x] + vbroadcasti128 m5, [r0 + 4 * SIZEOF_PIXEL + x] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m7, m5, m1 + paddd m4, m7 + vextracti128 xm7, m4, 1 + paddd xm4, xm7 + paddd xm4, xm2 + psrad xm4, INTERP_SHIFT_PS + + vbroadcasti128 m6, [r0 + 16 + x] + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m7, m6, m1 + paddd m5, m7 + vextracti128 xm7, m5, 1 + paddd xm5, xm7 + paddd xm5, xm2 + psrad xm5, INTERP_SHIFT_PS + + packssdw xm4, xm5 + movu [r2 + x], xm4 + + vbroadcasti128 m5, [r0 + 24 + x] + pshufb m5, m3 + + pmaddwd m6, m0 + pmaddwd m7, m5, m1 + paddd m6, m7 + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + paddd xm6, xm2 + psrad xm6, INTERP_SHIFT_PS + + vbroadcasti128 m7, [r0 + 32 + x] + pshufb m7, m3 + + pmaddwd m5, m0 + pmaddwd m7, m1 + paddd m5, m7 + vextracti128 xm7, m5, 1 + paddd xm5, xm7 + paddd xm5, xm2 + psrad xm5, INTERP_SHIFT_PS + + packssdw xm6, xm5 + movu [r2 + 16 + x], xm6 + +%assign x x+32 +%endrep + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + + IPFILTER_LUMA_PS_32_64_AVX2 32, 8 + IPFILTER_LUMA_PS_32_64_AVX2 32, 16 + IPFILTER_LUMA_PS_32_64_AVX2 32, 24 + IPFILTER_LUMA_PS_32_64_AVX2 32, 32 + IPFILTER_LUMA_PS_32_64_AVX2 32, 64 + + IPFILTER_LUMA_PS_32_64_AVX2 64, 16 + IPFILTER_LUMA_PS_32_64_AVX2 64, 32 + IPFILTER_LUMA_PS_32_64_AVX2 64, 48 + IPFILTER_LUMA_PS_32_64_AVX2 64, 64 + + IPFILTER_LUMA_PS_32_64_AVX2 48, 64 + +%macro IPFILTER_LUMA_PS_16xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_16x%1, 4, 6, 8 + + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + shl r4d, 4 +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4] + vpbroadcastq m1, [r6 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 6 + test r5d, r5d + mov r4d, %1 + jz .loop0 + lea r6, [r1*3] + sub r0, r6 + add r4d, 7 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m7, m5, m3 + pmaddwd m4, m0 + pmaddwd m7, m1 + paddd m4, m7 + + vbroadcasti128 m6, [r0 + 16] + pshufb m5, m3 + pshufb m7, m6, m3 + pmaddwd m5, m0 + pmaddwd m7, m1 + paddd m5, m7 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m2 + vextracti128 xm5, m4, 1 + psrad xm4, INTERP_SHIFT_PS + psrad xm5, INTERP_SHIFT_PS + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m5, [r0 + 24] + pshufb m6, m3 + pshufb m7, m5, m3 + pmaddwd m6, m0 + pmaddwd m7, m1 + paddd m6, m7 + + vbroadcasti128 m7, [r0 + 32] + pshufb m5, m3 + pshufb m7, m3 + pmaddwd m5, m0 + pmaddwd m7, m1 + paddd m5, m7 + + phaddd m6, m5 + vpermq m6, m6, q3120 + paddd m6, m2 + vextracti128 xm5,m6, 1 + psrad xm6, INTERP_SHIFT_PS + psrad xm5, INTERP_SHIFT_PS + packssdw xm6, xm5 + movu [r2 + 16], xm6 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + + IPFILTER_LUMA_PS_16xN_AVX2 4 + IPFILTER_LUMA_PS_16xN_AVX2 8 + IPFILTER_LUMA_PS_16xN_AVX2 12 + IPFILTER_LUMA_PS_16xN_AVX2 16 + IPFILTER_LUMA_PS_16xN_AVX2 32 + IPFILTER_LUMA_PS_16xN_AVX2 64 + +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_12x16, 4, 6, 8 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + shl r4d, 4 +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4] + vpbroadcastq m1, [r6 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 6 + test r5d, r5d + mov r4d, 16 + jz .loop0 + lea r6, [r1*3] + sub r0, r6 + add r4d, 7 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m7, m5, m3 + pmaddwd m4, m0 + pmaddwd m7, m1 + paddd m4, m7 + + vbroadcasti128 m6, [r0 + 16] + pshufb m5, m3 + pshufb m7, m6, m3 + pmaddwd m5, m0 + pmaddwd m7, m1 + paddd m5, m7 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m2 + vextracti128 xm5,m4, 1 + psrad xm4, INTERP_SHIFT_PS + psrad xm5, INTERP_SHIFT_PS + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m5, [r0 + 24] + pshufb m6, m3 + pshufb m5, m3 + pmaddwd m6, m0 + pmaddwd m5, m1 + paddd m6, m5 + + phaddd m6, m6 + vpermq m6, m6, q3120 + paddd xm6, xm2 + psrad xm6, INTERP_SHIFT_PS + packssdw xm6, xm6 + movq [r2 + 16], xm6 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif + +%macro IPFILTER_CHROMA_PS_8xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_8x%1, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [interp8_hpp_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, %1 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2], xm4 + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + + IPFILTER_CHROMA_PS_8xN_AVX2 4 + IPFILTER_CHROMA_PS_8xN_AVX2 8 + IPFILTER_CHROMA_PS_8xN_AVX2 16 + IPFILTER_CHROMA_PS_8xN_AVX2 32 + IPFILTER_CHROMA_PS_8xN_AVX2 6 + IPFILTER_CHROMA_PS_8xN_AVX2 2 + IPFILTER_CHROMA_PS_8xN_AVX2 12 + IPFILTER_CHROMA_PS_8xN_AVX2 64 + +%macro IPFILTER_CHROMA_PS_16xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_16x%1, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [interp8_hpp_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, %1 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 16], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + +IPFILTER_CHROMA_PS_16xN_AVX2 16 +IPFILTER_CHROMA_PS_16xN_AVX2 8 +IPFILTER_CHROMA_PS_16xN_AVX2 32 +IPFILTER_CHROMA_PS_16xN_AVX2 12 +IPFILTER_CHROMA_PS_16xN_AVX2 4 +IPFILTER_CHROMA_PS_16xN_AVX2 64 +IPFILTER_CHROMA_PS_16xN_AVX2 24 + +%macro IPFILTER_CHROMA_PS_24xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_24x%1, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [interp8_hpp_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, %1 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 16], xm4 + + vbroadcasti128 m4, [r0 + 32] + vbroadcasti128 m5, [r0 + 40] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 32], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + +IPFILTER_CHROMA_PS_24xN_AVX2 32 +IPFILTER_CHROMA_PS_24xN_AVX2 64 + +%macro IPFILTER_CHROMA_PS_12xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_12x%1, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [interp8_hpp_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, %1 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + pshufb m4, m3 + pmaddwd m4, m0 + phaddd m4, m4 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movq [r2 + 16], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + +IPFILTER_CHROMA_PS_12xN_AVX2 16 +IPFILTER_CHROMA_PS_12xN_AVX2 32 + +%macro IPFILTER_CHROMA_PS_32xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_32x%1, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [interp8_hpp_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, %1 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 16], xm4 + + vbroadcasti128 m4, [r0 + 32] + vbroadcasti128 m5, [r0 + 40] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 32], xm4 + + vbroadcasti128 m4, [r0 + 48] + vbroadcasti128 m5, [r0 + 56] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 48], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + +IPFILTER_CHROMA_PS_32xN_AVX2 32 +IPFILTER_CHROMA_PS_32xN_AVX2 16 +IPFILTER_CHROMA_PS_32xN_AVX2 24 +IPFILTER_CHROMA_PS_32xN_AVX2 8 +IPFILTER_CHROMA_PS_32xN_AVX2 64 +IPFILTER_CHROMA_PS_32xN_AVX2 48 + + +%macro IPFILTER_CHROMA_PS_64xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_64x%1, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [interp8_hpp_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, %1 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 16], xm4 + + vbroadcasti128 m4, [r0 + 32] + vbroadcasti128 m5, [r0 + 40] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 32], xm4 + + vbroadcasti128 m4, [r0 + 48] + vbroadcasti128 m5, [r0 + 56] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 48], xm4 + + vbroadcasti128 m4, [r0 + 64] + vbroadcasti128 m5, [r0 + 72] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 64], xm4 + + vbroadcasti128 m4, [r0 + 80] + vbroadcasti128 m5, [r0 + 88] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 80], xm4 + + vbroadcasti128 m4, [r0 + 96] + vbroadcasti128 m5, [r0 + 104] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 96], xm4 + + vbroadcasti128 m4, [r0 + 112] + vbroadcasti128 m5, [r0 + 120] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 112], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + +IPFILTER_CHROMA_PS_64xN_AVX2 64 +IPFILTER_CHROMA_PS_64xN_AVX2 48 +IPFILTER_CHROMA_PS_64xN_AVX2 32 +IPFILTER_CHROMA_PS_64xN_AVX2 16 + +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_48x64, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [interp8_hpp_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, 64 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 16], xm4 + + vbroadcasti128 m4, [r0 + 32] + vbroadcasti128 m5, [r0 + 40] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 32], xm4 + + vbroadcasti128 m4, [r0 + 48] + vbroadcasti128 m5, [r0 + 56] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 48], xm4 + + vbroadcasti128 m4, [r0 + 64] + vbroadcasti128 m5, [r0 + 72] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 64], xm4 + + vbroadcasti128 m4, [r0 + 80] + vbroadcasti128 m5, [r0 + 88] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 80], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif + +%macro IPFILTER_CHROMA_PS_6xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_6x%1, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [interp8_hpp_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, %1 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, INTERP_SHIFT_PS + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movq [r2], xm4 + pextrd [r2 + 8], xm4, 2 + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + + IPFILTER_CHROMA_PS_6xN_AVX2 8 + IPFILTER_CHROMA_PS_6xN_AVX2 16 + +%macro FILTER_VER_CHROMA_AVX2_8xN 2 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_8x%2, 4, 9, 15 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + vbroadcasti128 m14, [pd_32] +%elifidn %1, sp + vbroadcasti128 m14, [INTERP_OFFSET_SP] +%else + vbroadcasti128 m14, [INTERP_OFFSET_PS] +%endif + lea r6, [r3 * 3] + lea r7, [r1 * 4] + mov r8d, %2 / 16 +.loopH: + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] + + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm7, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddwd m7, m5, [r5 + 1 * mmsize] + paddd m3, m7 + pmaddwd m5, [r5] + + movu xm7, [r0 + r4] ; m7 = row 7 + punpckhwd xm8, xm6, xm7 + punpcklwd xm6, xm7 + vinserti128 m6, m6, xm8, 1 + pmaddwd m8, m6, [r5 + 1 * mmsize] + paddd m4, m8 + pmaddwd m6, [r5] + + lea r0, [r0 + r1 * 4] + movu xm8, [r0] ; m8 = row 8 + punpckhwd xm9, xm7, xm8 + punpcklwd xm7, xm8 + vinserti128 m7, m7, xm9, 1 + pmaddwd m9, m7, [r5 + 1 * mmsize] + paddd m5, m9 + pmaddwd m7, [r5] + + + movu xm9, [r0 + r1] ; m9 = row 9 + punpckhwd xm10, xm8, xm9 + punpcklwd xm8, xm9 + vinserti128 m8, m8, xm10, 1 + pmaddwd m10, m8, [r5 + 1 * mmsize] + paddd m6, m10 + pmaddwd m8, [r5] + + + movu xm10, [r0 + r1 * 2] ; m10 = row 10 + punpckhwd xm11, xm9, xm10 + punpcklwd xm9, xm10 + vinserti128 m9, m9, xm11, 1 + pmaddwd m11, m9, [r5 + 1 * mmsize] + paddd m7, m11 + pmaddwd m9, [r5] + + movu xm11, [r0 + r4] ; m11 = row 11 + punpckhwd xm12, xm10, xm11 + punpcklwd xm10, xm11 + vinserti128 m10, m10, xm12, 1 + pmaddwd m12, m10, [r5 + 1 * mmsize] + paddd m8, m12 + pmaddwd m10, [r5] + + lea r0, [r0 + r1 * 4] + movu xm12, [r0] ; m12 = row 12 + punpckhwd xm13, xm11, xm12 + punpcklwd xm11, xm12 + vinserti128 m11, m11, xm13, 1 + pmaddwd m13, m11, [r5 + 1 * mmsize] + paddd m9, m13 + pmaddwd m11, [r5] + +%ifidn %1,ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + psrad m4, 6 + psrad m5, 6 +%else + paddd m0, m14 + paddd m1, m14 + paddd m2, m14 + paddd m3, m14 + paddd m4, m14 + paddd m5, m14 +%ifidn %1,pp + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + psrad m4, 6 + psrad m5, 6 +%elifidn %1, sp + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + psrad m4, INTERP_SHIFT_SP + psrad m5, INTERP_SHIFT_SP +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS +%endif +%endif + + packssdw m0, m1 + packssdw m2, m3 + packssdw m4, m5 + vpermq m0, m0, q3120 + vpermq m2, m2, q3120 + vpermq m4, m4, q3120 + pxor m5, m5 + mova m3, [pw_pixel_max] +%ifidn %1,pp + CLIPW m0, m5, m3 + CLIPW m2, m5, m3 + CLIPW m4, m5, m3 +%elifidn %1, sp + CLIPW m0, m5, m3 + CLIPW m2, m5, m3 + CLIPW m4, m5, m3 +%endif + + vextracti128 xm1, m0, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + vextracti128 xm1, m2, 1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm1 + lea r2, [r2 + r3 * 4] + vextracti128 xm1, m4, 1 + movu [r2], xm4 + movu [r2 + r3], xm1 + + movu xm13, [r0 + r1] ; m13 = row 13 + punpckhwd xm0, xm12, xm13 + punpcklwd xm12, xm13 + vinserti128 m12, m12, xm0, 1 + pmaddwd m0, m12, [r5 + 1 * mmsize] + paddd m10, m0 + pmaddwd m12, [r5] + + movu xm0, [r0 + r1 * 2] ; m0 = row 14 + punpckhwd xm1, xm13, xm0 + punpcklwd xm13, xm0 + vinserti128 m13, m13, xm1, 1 + pmaddwd m1, m13, [r5 + 1 * mmsize] + paddd m11, m1 + pmaddwd m13, [r5] + +%ifidn %1,ss + psrad m6, 6 + psrad m7, 6 +%else + paddd m6, m14 + paddd m7, m14 +%ifidn %1,pp + psrad m6, 6 + psrad m7, 6 +%elifidn %1, sp + psrad m6, INTERP_SHIFT_SP + psrad m7, INTERP_SHIFT_SP +%else + psrad m6, INTERP_SHIFT_PS + psrad m7, INTERP_SHIFT_PS +%endif +%endif + + packssdw m6, m7 + vpermq m6, m6, q3120 +%ifidn %1,pp + CLIPW m6, m5, m3 +%elifidn %1, sp + CLIPW m6, m5, m3 +%endif + vextracti128 xm7, m6, 1 + movu [r2 + r3 * 2], xm6 + movu [r2 + r6], xm7 + + movu xm1, [r0 + r4] ; m1 = row 15 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m2, m0, [r5 + 1 * mmsize] + paddd m12, m2 + pmaddwd m0, [r5] + + lea r0, [r0 + r1 * 4] + movu xm2, [r0] ; m2 = row 16 + punpckhwd xm6, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm6, 1 + pmaddwd m6, m1, [r5 + 1 * mmsize] + paddd m13, m6 + pmaddwd m1, [r5] + + movu xm6, [r0 + r1] ; m6 = row 17 + punpckhwd xm4, xm2, xm6 + punpcklwd xm2, xm6 + vinserti128 m2, m2, xm4, 1 + pmaddwd m2, [r5 + 1 * mmsize] + paddd m0, m2 + + movu xm4, [r0 + r1 * 2] ; m4 = row 18 + punpckhwd xm2, xm6, xm4 + punpcklwd xm6, xm4 + vinserti128 m6, m6, xm2, 1 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m1, m6 + +%ifidn %1,ss + psrad m8, 6 + psrad m9, 6 + psrad m10, 6 + psrad m11, 6 + psrad m12, 6 + psrad m13, 6 + psrad m0, 6 + psrad m1, 6 +%else + paddd m8, m14 + paddd m9, m14 + paddd m10, m14 + paddd m11, m14 + paddd m12, m14 + paddd m13, m14 + paddd m0, m14 + paddd m1, m14 +%ifidn %1,pp + psrad m8, 6 + psrad m9, 6 + psrad m10, 6 + psrad m11, 6 + psrad m12, 6 + psrad m13, 6 + psrad m0, 6 + psrad m1, 6 +%elifidn %1, sp + psrad m8, INTERP_SHIFT_SP + psrad m9, INTERP_SHIFT_SP + psrad m10, INTERP_SHIFT_SP + psrad m11, INTERP_SHIFT_SP + psrad m12, INTERP_SHIFT_SP + psrad m13, INTERP_SHIFT_SP + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP +%else + psrad m8, INTERP_SHIFT_PS + psrad m9, INTERP_SHIFT_PS + psrad m10, INTERP_SHIFT_PS + psrad m11, INTERP_SHIFT_PS + psrad m12, INTERP_SHIFT_PS + psrad m13, INTERP_SHIFT_PS + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS +%endif +%endif + + packssdw m8, m9 + packssdw m10, m11 + packssdw m12, m13 + packssdw m0, m1 + vpermq m8, m8, q3120 + vpermq m10, m10, q3120 + vpermq m12, m12, q3120 + vpermq m0, m0, q3120 +%ifidn %1,pp + CLIPW m8, m5, m3 + CLIPW m10, m5, m3 + CLIPW m12, m5, m3 + CLIPW m0, m5, m3 +%elifidn %1, sp + CLIPW m8, m5, m3 + CLIPW m10, m5, m3 + CLIPW m12, m5, m3 + CLIPW m0, m5, m3 +%endif + vextracti128 xm9, m8, 1 + vextracti128 xm11, m10, 1 + vextracti128 xm13, m12, 1 + vextracti128 xm1, m0, 1 + lea r2, [r2 + r3 * 4] + movu [r2], xm8 + movu [r2 + r3], xm9 + movu [r2 + r3 * 2], xm10 + movu [r2 + r6], xm11 + lea r2, [r2 + r3 * 4] + movu [r2], xm12 + movu [r2 + r3], xm13 + movu [r2 + r3 * 2], xm0 + movu [r2 + r6], xm1 + lea r2, [r2 + r3 * 4] + dec r8d + jnz .loopH + RET +%endif +%endmacro + +FILTER_VER_CHROMA_AVX2_8xN pp, 16 +FILTER_VER_CHROMA_AVX2_8xN ps, 16 +FILTER_VER_CHROMA_AVX2_8xN ss, 16 +FILTER_VER_CHROMA_AVX2_8xN sp, 16 +FILTER_VER_CHROMA_AVX2_8xN pp, 32 +FILTER_VER_CHROMA_AVX2_8xN ps, 32 +FILTER_VER_CHROMA_AVX2_8xN sp, 32 +FILTER_VER_CHROMA_AVX2_8xN ss, 32 +FILTER_VER_CHROMA_AVX2_8xN pp, 64 +FILTER_VER_CHROMA_AVX2_8xN ps, 64 +FILTER_VER_CHROMA_AVX2_8xN sp, 64 +FILTER_VER_CHROMA_AVX2_8xN ss, 64 + +%macro PROCESS_CHROMA_AVX2_8x2 3 + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m2, m2, [r5 + 1 * mmsize] + paddd m0, m2 + + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m3, m3, [r5 + 1 * mmsize] + paddd m1, m3 + +%ifnidn %1,ss + paddd m0, m7 + paddd m1, m7 +%endif + psrad m0, %3 + psrad m1, %3 + + packssdw m0, m1 + vpermq m0, m0, q3120 + pxor m4, m4 + +%if %2 + CLIPW m0, m4, [pw_pixel_max] +%endif + vextracti128 xm1, m0, 1 +%endmacro + + +%macro FILTER_VER_CHROMA_AVX2_8x2 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x2, 4, 6, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + vbroadcasti128 m7, [pd_32] +%elifidn %1, sp + vbroadcasti128 m7, [INTERP_OFFSET_SP] +%else + vbroadcasti128 m7, [INTERP_OFFSET_PS] +%endif + + PROCESS_CHROMA_AVX2_8x2 %1, %2, %3 + movu [r2], xm0 + movu [r2 + r3], xm1 + RET +%endmacro + +FILTER_VER_CHROMA_AVX2_8x2 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_8x2 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_8x2 sp, 1, INTERP_SHIFT_SP +FILTER_VER_CHROMA_AVX2_8x2 ss, 0, 6 + +%macro FILTER_VER_CHROMA_AVX2_4x2 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x2, 4, 6, 7 + mov r4d, r4m + add r1d, r1d + add r3d, r3d + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + +%ifidn %1,pp + vbroadcasti128 m6, [pd_32] +%elifidn %1, sp + vbroadcasti128 m6, [INTERP_OFFSET_SP] +%else + vbroadcasti128 m6, [INTERP_OFFSET_PS] +%endif + + movq xm0, [r0] ; row 0 + movq xm1, [r0 + r1] ; row 1 + punpcklwd xm0, xm1 + + movq xm2, [r0 + r1 * 2] ; row 2 + punpcklwd xm1, xm2 + vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] + pmaddwd m0, [r5] + + movq xm3, [r0 + r4] ; row 3 + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movq xm4, [r0] ; row 4 + punpcklwd xm3, xm4 + vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] + pmaddwd m5, m2, [r5 + 1 * mmsize] + paddd m0, m5 + +%ifnidn %1, ss + paddd m0, m6 +%endif + psrad m0, %3 + packssdw m0, m0 + pxor m1, m1 + +%if %2 + CLIPW m0, m1, [pw_pixel_max] +%endif + + vextracti128 xm2, m0, 1 + lea r4, [r3 * 3] + movq [r2], xm0 + movq [r2 + r3], xm2 + RET +%endmacro + +FILTER_VER_CHROMA_AVX2_4x2 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_4x2 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_4x2 sp, 1, INTERP_SHIFT_SP +FILTER_VER_CHROMA_AVX2_4x2 ss, 0, 6 + +%macro FILTER_VER_CHROMA_AVX2_4x4 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x4, 4, 6, 7 + mov r4d, r4m + add r1d, r1d + add r3d, r3d + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + +%ifidn %1,pp + vbroadcasti128 m6, [pd_32] +%elifidn %1, sp + vbroadcasti128 m6, [INTERP_OFFSET_SP] +%else + vbroadcasti128 m6, [INTERP_OFFSET_PS] +%endif + movq xm0, [r0] ; row 0 + movq xm1, [r0 + r1] ; row 1 + punpcklwd xm0, xm1 + + movq xm2, [r0 + r1 * 2] ; row 2 + punpcklwd xm1, xm2 + vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] + pmaddwd m0, [r5] + + movq xm3, [r0 + r4] ; row 3 + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movq xm4, [r0] ; row 4 + punpcklwd xm3, xm4 + vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] + pmaddwd m5, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m5 + + movq xm3, [r0 + r1] ; row 5 + punpcklwd xm4, xm3 + movq xm1, [r0 + r1 * 2] ; row 6 + punpcklwd xm3, xm1 + vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] + pmaddwd m4, [r5 + 1 * mmsize] + paddd m2, m4 + +%ifnidn %1,ss + paddd m0, m6 + paddd m2, m6 +%endif + psrad m0, %3 + psrad m2, %3 + + packssdw m0, m2 + pxor m1, m1 +%if %2 + CLIPW m0, m1, [pw_pixel_max] +%endif + + vextracti128 xm2, m0, 1 + lea r4, [r3 * 3] + movq [r2], xm0 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r4], xm2 + RET +%endmacro + +FILTER_VER_CHROMA_AVX2_4x4 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_4x4 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_4x4 sp, 1, INTERP_SHIFT_SP +FILTER_VER_CHROMA_AVX2_4x4 ss, 0, 6 + + +%macro FILTER_VER_CHROMA_AVX2_4x8 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x8, 4, 7, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + +%ifidn %1,pp + vbroadcasti128 m7, [pd_32] +%elifidn %1, sp + vbroadcasti128 m7, [INTERP_OFFSET_SP] +%else + vbroadcasti128 m7, [INTERP_OFFSET_PS] +%endif + lea r6, [r3 * 3] + + movq xm0, [r0] ; row 0 + movq xm1, [r0 + r1] ; row 1 + punpcklwd xm0, xm1 + movq xm2, [r0 + r1 * 2] ; row 2 + punpcklwd xm1, xm2 + vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] + pmaddwd m0, [r5] + + movq xm3, [r0 + r4] ; row 3 + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movq xm4, [r0] ; row 4 + punpcklwd xm3, xm4 + vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] + pmaddwd m5, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m5 + + movq xm3, [r0 + r1] ; row 5 + punpcklwd xm4, xm3 + movq xm1, [r0 + r1 * 2] ; row 6 + punpcklwd xm3, xm1 + vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] + pmaddwd m5, m4, [r5 + 1 * mmsize] + paddd m2, m5 + pmaddwd m4, [r5] + + movq xm3, [r0 + r4] ; row 7 + punpcklwd xm1, xm3 + lea r0, [r0 + 4 * r1] + movq xm6, [r0] ; row 8 + punpcklwd xm3, xm6 + vinserti128 m1, m1, xm3, 1 ; m1 = [8 7 7 6] + pmaddwd m5, m1, [r5 + 1 * mmsize] + paddd m4, m5 + pmaddwd m1, [r5] + + movq xm3, [r0 + r1] ; row 9 + punpcklwd xm6, xm3 + movq xm5, [r0 + 2 * r1] ; row 10 + punpcklwd xm3, xm5 + vinserti128 m6, m6, xm3, 1 ; m6 = [A 9 9 8] + pmaddwd m6, [r5 + 1 * mmsize] + paddd m1, m6 +%ifnidn %1,ss + paddd m0, m7 + paddd m2, m7 +%endif + psrad m0, %3 + psrad m2, %3 + packssdw m0, m2 + pxor m6, m6 + mova m3, [pw_pixel_max] +%if %2 + CLIPW m0, m6, m3 +%endif + vextracti128 xm2, m0, 1 + movq [r2], xm0 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r6], xm2 +%ifnidn %1,ss + paddd m4, m7 + paddd m1, m7 +%endif + psrad m4, %3 + psrad m1, %3 + packssdw m4, m1 +%if %2 + CLIPW m4, m6, m3 +%endif + vextracti128 xm1, m4, 1 + lea r2, [r2 + r3 * 4] + movq [r2], xm4 + movq [r2 + r3], xm1 + movhps [r2 + r3 * 2], xm4 + movhps [r2 + r6], xm1 + RET +%endmacro + +FILTER_VER_CHROMA_AVX2_4x8 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_4x8 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_4x8 sp, 1, INTERP_SHIFT_SP +FILTER_VER_CHROMA_AVX2_4x8 ss, 0 , 6 + +%macro PROCESS_LUMA_AVX2_W4_16R_4TAP 3 + movq xm0, [r0] ; row 0 + movq xm1, [r0 + r1] ; row 1 + punpcklwd xm0, xm1 + movq xm2, [r0 + r1 * 2] ; row 2 + punpcklwd xm1, xm2 + vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] + pmaddwd m0, [r5] + movq xm3, [r0 + r4] ; row 3 + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movq xm4, [r0] ; row 4 + punpcklwd xm3, xm4 + vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] + pmaddwd m5, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m5 + movq xm3, [r0 + r1] ; row 5 + punpcklwd xm4, xm3 + movq xm1, [r0 + r1 * 2] ; row 6 + punpcklwd xm3, xm1 + vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] + pmaddwd m5, m4, [r5 + 1 * mmsize] + paddd m2, m5 + pmaddwd m4, [r5] + movq xm3, [r0 + r4] ; row 7 + punpcklwd xm1, xm3 + lea r0, [r0 + 4 * r1] + movq xm6, [r0] ; row 8 + punpcklwd xm3, xm6 + vinserti128 m1, m1, xm3, 1 ; m1 = [8 7 7 6] + pmaddwd m5, m1, [r5 + 1 * mmsize] + paddd m4, m5 + pmaddwd m1, [r5] + movq xm3, [r0 + r1] ; row 9 + punpcklwd xm6, xm3 + movq xm5, [r0 + 2 * r1] ; row 10 + punpcklwd xm3, xm5 + vinserti128 m6, m6, xm3, 1 ; m6 = [10 9 9 8] + pmaddwd m3, m6, [r5 + 1 * mmsize] + paddd m1, m3 + pmaddwd m6, [r5] +%ifnidn %1,ss + paddd m0, m7 + paddd m2, m7 +%endif + psrad m0, %3 + psrad m2, %3 + packssdw m0, m2 + pxor m3, m3 +%if %2 + CLIPW m0, m3, [pw_pixel_max] +%endif + vextracti128 xm2, m0, 1 + movq [r2], xm0 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r6], xm2 + movq xm2, [r0 + r4] ;row 11 + punpcklwd xm5, xm2 + lea r0, [r0 + 4 * r1] + movq xm0, [r0] ; row 12 + punpcklwd xm2, xm0 + vinserti128 m5, m5, xm2, 1 ; m5 = [12 11 11 10] + pmaddwd m2, m5, [r5 + 1 * mmsize] + paddd m6, m2 + pmaddwd m5, [r5] + movq xm2, [r0 + r1] ; row 13 + punpcklwd xm0, xm2 + movq xm3, [r0 + 2 * r1] ; row 14 + punpcklwd xm2, xm3 + vinserti128 m0, m0, xm2, 1 ; m0 = [14 13 13 12] + pmaddwd m2, m0, [r5 + 1 * mmsize] + paddd m5, m2 + pmaddwd m0, [r5] +%ifnidn %1,ss + paddd m4, m7 + paddd m1, m7 +%endif + psrad m4, %3 + psrad m1, %3 + packssdw m4, m1 + pxor m2, m2 +%if %2 + CLIPW m4, m2, [pw_pixel_max] +%endif + + vextracti128 xm1, m4, 1 + lea r2, [r2 + r3 * 4] + movq [r2], xm4 + movq [r2 + r3], xm1 + movhps [r2 + r3 * 2], xm4 + movhps [r2 + r6], xm1 + movq xm4, [r0 + r4] ; row 15 + punpcklwd xm3, xm4 + lea r0, [r0 + 4 * r1] + movq xm1, [r0] ; row 16 + punpcklwd xm4, xm1 + vinserti128 m3, m3, xm4, 1 ; m3 = [16 15 15 14] + pmaddwd m4, m3, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m3, [r5] + movq xm4, [r0 + r1] ; row 17 + punpcklwd xm1, xm4 + movq xm2, [r0 + 2 * r1] ; row 18 + punpcklwd xm4, xm2 + vinserti128 m1, m1, xm4, 1 ; m1 = [18 17 17 16] + pmaddwd m1, [r5 + 1 * mmsize] + paddd m3, m1 + +%ifnidn %1,ss + paddd m6, m7 + paddd m5, m7 +%endif + psrad m6, %3 + psrad m5, %3 + packssdw m6, m5 + pxor m1, m1 +%if %2 + CLIPW m6, m1, [pw_pixel_max] +%endif + vextracti128 xm5, m6, 1 + lea r2, [r2 + r3 * 4] + movq [r2], xm6 + movq [r2 + r3], xm5 + movhps [r2 + r3 * 2], xm6 + movhps [r2 + r6], xm5 +%ifnidn %1,ss + paddd m0, m7 + paddd m3, m7 +%endif + psrad m0, %3 + psrad m3, %3 + packssdw m0, m3 +%if %2 + CLIPW m0, m1, [pw_pixel_max] +%endif + vextracti128 xm3, m0, 1 + lea r2, [r2 + r3 * 4] + movq [r2], xm0 + movq [r2 + r3], xm3 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r6], xm3 +%endmacro + + +%macro FILTER_VER_CHROMA_AVX2_4xN 4 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x%2, 4, 8, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + mov r7d, %2 / 16 +%ifidn %1,pp + vbroadcasti128 m7, [pd_32] +%elifidn %1, sp + vbroadcasti128 m7, [INTERP_OFFSET_SP] +%else + vbroadcasti128 m7, [INTERP_OFFSET_PS] +%endif + lea r6, [r3 * 3] +.loopH: + PROCESS_LUMA_AVX2_W4_16R_4TAP %1, %3, %4 + lea r2, [r2 + r3 * 4] + dec r7d + jnz .loopH + RET +%endmacro + +FILTER_VER_CHROMA_AVX2_4xN pp, 16, 1, 6 +FILTER_VER_CHROMA_AVX2_4xN ps, 16, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_4xN sp, 16, 1, INTERP_SHIFT_SP +FILTER_VER_CHROMA_AVX2_4xN ss, 16, 0, 6 +FILTER_VER_CHROMA_AVX2_4xN pp, 32, 1, 6 +FILTER_VER_CHROMA_AVX2_4xN ps, 32, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_4xN sp, 32, 1, INTERP_SHIFT_SP +FILTER_VER_CHROMA_AVX2_4xN ss, 32, 0, 6 + +%macro FILTER_VER_CHROMA_AVX2_8x8 3 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_8x8, 4, 6, 12 + mov r4d, r4m + add r1d, r1d + add r3d, r3d + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + +%ifidn %1,pp + vbroadcasti128 m11, [pd_32] +%elifidn %1, sp + vbroadcasti128 m11, [INTERP_OFFSET_SP] +%else + vbroadcasti128 m11, [INTERP_OFFSET_PS] +%endif + + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m4 ; res row0 done(0,1,2,3) + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + pmaddwd m3, [r5] + paddd m1, m5 ;res row1 done(1, 2, 3, 4) + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + pmaddwd m4, [r5] + paddd m2, m6 ;res row2 done(2,3,4,5) + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm7, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddwd m7, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m7 ;res row3 done(3,4,5,6) + movu xm7, [r0 + r4] ; m7 = row 7 + punpckhwd xm8, xm6, xm7 + punpcklwd xm6, xm7 + vinserti128 m6, m6, xm8, 1 + pmaddwd m8, m6, [r5 + 1 * mmsize] + pmaddwd m6, [r5] + paddd m4, m8 ;res row4 done(4,5,6,7) + lea r0, [r0 + r1 * 4] + movu xm8, [r0] ; m8 = row 8 + punpckhwd xm9, xm7, xm8 + punpcklwd xm7, xm8 + vinserti128 m7, m7, xm9, 1 + pmaddwd m9, m7, [r5 + 1 * mmsize] + pmaddwd m7, [r5] + paddd m5, m9 ;res row5 done(5,6,7,8) + movu xm9, [r0 + r1] ; m9 = row 9 + punpckhwd xm10, xm8, xm9 + punpcklwd xm8, xm9 + vinserti128 m8, m8, xm10, 1 + pmaddwd m8, [r5 + 1 * mmsize] + paddd m6, m8 ;res row6 done(6,7,8,9) + movu xm10, [r0 + r1 * 2] ; m10 = row 10 + punpckhwd xm8, xm9, xm10 + punpcklwd xm9, xm10 + vinserti128 m9, m9, xm8, 1 + pmaddwd m9, [r5 + 1 * mmsize] + paddd m7, m9 ;res row7 done 7,8,9,10 + lea r4, [r3 * 3] +%ifnidn %1,ss + paddd m0, m11 + paddd m1, m11 + paddd m2, m11 + paddd m3, m11 +%endif + psrad m0, %3 + psrad m1, %3 + psrad m2, %3 + psrad m3, %3 + packssdw m0, m1 + packssdw m2, m3 + vpermq m0, m0, q3120 + vpermq m2, m2, q3120 + pxor m1, m1 + mova m3, [pw_pixel_max] +%if %2 + CLIPW m0, m1, m3 + CLIPW m2, m1, m3 +%endif + vextracti128 xm9, m0, 1 + vextracti128 xm8, m2, 1 + movu [r2], xm0 + movu [r2 + r3], xm9 + movu [r2 + r3 * 2], xm2 + movu [r2 + r4], xm8 +%ifnidn %1,ss + paddd m4, m11 + paddd m5, m11 + paddd m6, m11 + paddd m7, m11 +%endif + psrad m4, %3 + psrad m5, %3 + psrad m6, %3 + psrad m7, %3 + packssdw m4, m5 + packssdw m6, m7 + vpermq m4, m4, q3120 + vpermq m6, m6, q3120 +%if %2 + CLIPW m4, m1, m3 + CLIPW m6, m1, m3 +%endif + vextracti128 xm5, m4, 1 + vextracti128 xm7, m6, 1 + lea r2, [r2 + r3 * 4] + movu [r2], xm4 + movu [r2 + r3], xm5 + movu [r2 + r3 * 2], xm6 + movu [r2 + r4], xm7 + RET +%endif +%endmacro + +FILTER_VER_CHROMA_AVX2_8x8 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_8x8 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_8x8 sp, 1, INTERP_SHIFT_SP +FILTER_VER_CHROMA_AVX2_8x8 ss, 0, 6 + +%macro FILTER_VER_CHROMA_AVX2_8x6 3 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_8x6, 4, 6, 12 + mov r4d, r4m + add r1d, r1d + add r3d, r3d + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + +%ifidn %1,pp + vbroadcasti128 m11, [pd_32] +%elifidn %1, sp + vbroadcasti128 m11, [INTERP_OFFSET_SP] +%else + vbroadcasti128 m11, [INTERP_OFFSET_PS] +%endif + + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m4 ; r0 done(0,1,2,3) + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + pmaddwd m3, [r5] + paddd m1, m5 ;r1 done(1, 2, 3, 4) + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + pmaddwd m4, [r5] + paddd m2, m6 ;r2 done(2,3,4,5) + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm7, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddwd m7, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m7 ;r3 done(3,4,5,6) + movu xm7, [r0 + r4] ; m7 = row 7 + punpckhwd xm8, xm6, xm7 + punpcklwd xm6, xm7 + vinserti128 m6, m6, xm8, 1 + pmaddwd m8, m6, [r5 + 1 * mmsize] + paddd m4, m8 ;r4 done(4,5,6,7) + lea r0, [r0 + r1 * 4] + movu xm8, [r0] ; m8 = row 8 + punpckhwd xm9, xm7, xm8 + punpcklwd xm7, xm8 + vinserti128 m7, m7, xm9, 1 + pmaddwd m7, m7, [r5 + 1 * mmsize] + paddd m5, m7 ;r5 done(5,6,7,8) + lea r4, [r3 * 3] +%ifnidn %1,ss + paddd m0, m11 + paddd m1, m11 + paddd m2, m11 + paddd m3, m11 +%endif + psrad m0, %3 + psrad m1, %3 + psrad m2, %3 + psrad m3, %3 + packssdw m0, m1 + packssdw m2, m3 + vpermq m0, m0, q3120 + vpermq m2, m2, q3120 + pxor m10, m10 + mova m9, [pw_pixel_max] +%if %2 + CLIPW m0, m10, m9 + CLIPW m2, m10, m9 +%endif + vextracti128 xm1, m0, 1 + vextracti128 xm3, m2, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r4], xm3 +%ifnidn %1,ss + paddd m4, m11 + paddd m5, m11 +%endif + psrad m4, %3 + psrad m5, %3 + packssdw m4, m5 + vpermq m4, m4, 11011000b +%if %2 + CLIPW m4, m10, m9 +%endif + vextracti128 xm5, m4, 1 + lea r2, [r2 + r3 * 4] + movu [r2], xm4 + movu [r2 + r3], xm5 + RET +%endif +%endmacro + +FILTER_VER_CHROMA_AVX2_8x6 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_8x6 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_8x6 sp, 1, INTERP_SHIFT_SP +FILTER_VER_CHROMA_AVX2_8x6 ss, 0, 6 + +%macro PROCESS_CHROMA_AVX2 3 + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m2, m4 + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm4, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm4, 1 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m3, m5 +%ifnidn %1,ss + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 +%endif + psrad m0, %3 + psrad m1, %3 + psrad m2, %3 + psrad m3, %3 + packssdw m0, m1 + packssdw m2, m3 + vpermq m0, m0, q3120 + vpermq m2, m2, q3120 + pxor m4, m4 +%if %2 + CLIPW m0, m4, [pw_pixel_max] + CLIPW m2, m4, [pw_pixel_max] +%endif + vextracti128 xm1, m0, 1 + vextracti128 xm3, m2, 1 +%endmacro + + +%macro FILTER_VER_CHROMA_AVX2_8x4 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x4, 4, 6, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + add r3d, r3d +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + vbroadcasti128 m7, [pd_32] +%elifidn %1, sp + vbroadcasti128 m7, [INTERP_OFFSET_SP] +%else + vbroadcasti128 m7, [INTERP_OFFSET_PS] +%endif + PROCESS_CHROMA_AVX2 %1, %2, %3 + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm2 + lea r4, [r3 * 3] + movu [r2 + r4], xm3 + RET +%endmacro + +FILTER_VER_CHROMA_AVX2_8x4 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_8x4 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_8x4 sp, 1, INTERP_SHIFT_SP +FILTER_VER_CHROMA_AVX2_8x4 ss, 0, 6 + +%macro FILTER_VER_CHROMA_AVX2_8x12 3 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_8x12, 4, 7, 15 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + vbroadcasti128 m14, [pd_32] +%elifidn %1, sp + vbroadcasti128 m14, [INTERP_OFFSET_SP] +%else + vbroadcasti128 m14, [INTERP_OFFSET_PS] +%endif + lea r6, [r3 * 3] + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm7, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddwd m7, m5, [r5 + 1 * mmsize] + paddd m3, m7 + pmaddwd m5, [r5] + movu xm7, [r0 + r4] ; m7 = row 7 + punpckhwd xm8, xm6, xm7 + punpcklwd xm6, xm7 + vinserti128 m6, m6, xm8, 1 + pmaddwd m8, m6, [r5 + 1 * mmsize] + paddd m4, m8 + pmaddwd m6, [r5] + lea r0, [r0 + r1 * 4] + movu xm8, [r0] ; m8 = row 8 + punpckhwd xm9, xm7, xm8 + punpcklwd xm7, xm8 + vinserti128 m7, m7, xm9, 1 + pmaddwd m9, m7, [r5 + 1 * mmsize] + paddd m5, m9 + pmaddwd m7, [r5] + movu xm9, [r0 + r1] ; m9 = row 9 + punpckhwd xm10, xm8, xm9 + punpcklwd xm8, xm9 + vinserti128 m8, m8, xm10, 1 + pmaddwd m10, m8, [r5 + 1 * mmsize] + paddd m6, m10 + pmaddwd m8, [r5] + movu xm10, [r0 + r1 * 2] ; m10 = row 10 + punpckhwd xm11, xm9, xm10 + punpcklwd xm9, xm10 + vinserti128 m9, m9, xm11, 1 + pmaddwd m11, m9, [r5 + 1 * mmsize] + paddd m7, m11 + pmaddwd m9, [r5] + movu xm11, [r0 + r4] ; m11 = row 11 + punpckhwd xm12, xm10, xm11 + punpcklwd xm10, xm11 + vinserti128 m10, m10, xm12, 1 + pmaddwd m12, m10, [r5 + 1 * mmsize] + paddd m8, m12 + pmaddwd m10, [r5] + lea r0, [r0 + r1 * 4] + movu xm12, [r0] ; m12 = row 12 + punpckhwd xm13, xm11, xm12 + punpcklwd xm11, xm12 + vinserti128 m11, m11, xm13, 1 + pmaddwd m13, m11, [r5 + 1 * mmsize] + paddd m9, m13 + pmaddwd m11, [r5] +%ifnidn %1,ss + paddd m0, m14 + paddd m1, m14 + paddd m2, m14 + paddd m3, m14 + paddd m4, m14 + paddd m5, m14 +%endif + psrad m0, %3 + psrad m1, %3 + psrad m2, %3 + psrad m3, %3 + psrad m4, %3 + psrad m5, %3 + packssdw m0, m1 + packssdw m2, m3 + packssdw m4, m5 + vpermq m0, m0, q3120 + vpermq m2, m2, q3120 + vpermq m4, m4, q3120 + pxor m5, m5 + mova m3, [pw_pixel_max] +%if %2 + CLIPW m0, m5, m3 + CLIPW m2, m5, m3 + CLIPW m4, m5, m3 +%endif + vextracti128 xm1, m0, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + vextracti128 xm1, m2, 1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm1 + lea r2, [r2 + r3 * 4] + vextracti128 xm1, m4, 1 + movu [r2], xm4 + movu [r2 + r3], xm1 + movu xm13, [r0 + r1] ; m13 = row 13 + punpckhwd xm0, xm12, xm13 + punpcklwd xm12, xm13 + vinserti128 m12, m12, xm0, 1 + pmaddwd m12, m12, [r5 + 1 * mmsize] + paddd m10, m12 + movu xm0, [r0 + r1 * 2] ; m0 = row 14 + punpckhwd xm1, xm13, xm0 + punpcklwd xm13, xm0 + vinserti128 m13, m13, xm1, 1 + pmaddwd m13, m13, [r5 + 1 * mmsize] + paddd m11, m13 +%ifnidn %1,ss + paddd m6, m14 + paddd m7, m14 + paddd m8, m14 + paddd m9, m14 + paddd m10, m14 + paddd m11, m14 +%endif + psrad m6, %3 + psrad m7, %3 + psrad m8, %3 + psrad m9, %3 + psrad m10, %3 + psrad m11, %3 + packssdw m6, m7 + packssdw m8, m9 + packssdw m10, m11 + vpermq m6, m6, q3120 + vpermq m8, m8, q3120 + vpermq m10, m10, q3120 +%if %2 + CLIPW m6, m5, m3 + CLIPW m8, m5, m3 + CLIPW m10, m5, m3 +%endif + vextracti128 xm7, m6, 1 + vextracti128 xm9, m8, 1 + vextracti128 xm11, m10, 1 + movu [r2 + r3 * 2], xm6 + movu [r2 + r6], xm7 + lea r2, [r2 + r3 * 4] + movu [r2], xm8 + movu [r2 + r3], xm9 + movu [r2 + r3 * 2], xm10 + movu [r2 + r6], xm11 + RET +%endif +%endmacro + +FILTER_VER_CHROMA_AVX2_8x12 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_8x12 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_8x12 sp, 1, INTERP_SHIFT_SP +FILTER_VER_CHROMA_AVX2_8x12 ss, 0, 6
View file
x265_2.7.tar.gz/source/common/x86/ipfilter8.asm -> x265_2.6.tar.gz/source/common/x86/ipfilter8.asm
Changed
@@ -33,16 +33,119 @@ const interp4_vpp_shuf, times 2 db 0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, 13, 10, 14, 11, 15 +const interp_vert_shuf, times 2 db 0, 2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 8, 7, 9 + times 2 db 4, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 11, 10, 12, 11, 13 + const interp4_vpp_shuf1, dd 0, 1, 1, 2, 2, 3, 3, 4 dd 2, 3, 3, 4, 4, 5, 5, 6 +const pb_8tap_hps_0, times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 + times 2 db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10 + times 2 db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12 + times 2 db 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12,12,13,13,14 + const tab_Lm, db 0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8 db 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10 db 4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12 db 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14 +const tab_Vm, db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1 + db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3 + +const tab_Cm, db 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3 + const pd_526336, times 8 dd 8192*64+2048 +const tab_ChromaCoeff, db 0, 64, 0, 0 + db -2, 58, 10, -2 + db -4, 54, 16, -2 + db -6, 46, 28, -4 + db -4, 36, 36, -4 + db -4, 28, 46, -6 + db -2, 16, 54, -4 + db -2, 10, 58, -2 + +const tabw_ChromaCoeff, dw 0, 64, 0, 0 + dw -2, 58, 10, -2 + dw -4, 54, 16, -2 + dw -6, 46, 28, -4 + dw -4, 36, 36, -4 + dw -4, 28, 46, -6 + dw -2, 16, 54, -4 + dw -2, 10, 58, -2 + +const tab_ChromaCoeff_V, times 8 db 0, 64 + times 8 db 0, 0 + + times 8 db -2, 58 + times 8 db 10, -2 + + times 8 db -4, 54 + times 8 db 16, -2 + + times 8 db -6, 46 + times 8 db 28, -4 + + times 8 db -4, 36 + times 8 db 36, -4 + + times 8 db -4, 28 + times 8 db 46, -6 + + times 8 db -2, 16 + times 8 db 54, -4 + + times 8 db -2, 10 + times 8 db 58, -2 + +const tab_ChromaCoeffV, times 4 dw 0, 64 + times 4 dw 0, 0 + + times 4 dw -2, 58 + times 4 dw 10, -2 + + times 4 dw -4, 54 + times 4 dw 16, -2 + + times 4 dw -6, 46 + times 4 dw 28, -4 + + times 4 dw -4, 36 + times 4 dw 36, -4 + + times 4 dw -4, 28 + times 4 dw 46, -6 + + times 4 dw -2, 16 + times 4 dw 54, -4 + + times 4 dw -2, 10 + times 4 dw 58, -2 + +const pw_ChromaCoeffV, times 8 dw 0, 64 + times 8 dw 0, 0 + + times 8 dw -2, 58 + times 8 dw 10, -2 + + times 8 dw -4, 54 + times 8 dw 16, -2 + + times 8 dw -6, 46 + times 8 dw 28, -4 + + times 8 dw -4, 36 + times 8 dw 36, -4 + + times 8 dw -4, 28 + times 8 dw 46, -6 + + times 8 dw -2, 16 + times 8 dw 54, -4 + + times 8 dw -2, 10 + times 8 dw 58, -2 + const tab_LumaCoeff, db 0, 0, 0, 64, 0, 0, 0, 0 db -1, 4, -10, 58, 17, -5, 1, 0 db -1, 4, -11, 40, 40, -11, 4, -1 @@ -93,6 +196,26 @@ times 8 dw 58, -10 times 8 dw 4, -1 +const pb_LumaCoeffVer, times 16 db 0, 0 + times 16 db 0, 64 + times 16 db 0, 0 + times 16 db 0, 0 + + times 16 db -1, 4 + times 16 db -10, 58 + times 16 db 17, -5 + times 16 db 1, 0 + + times 16 db -1, 4 + times 16 db -11, 40 + times 16 db 40, -11 + times 16 db 4, -1 + + times 16 db 0, 1 + times 16 db -5, 17 + times 16 db 58, -10 + times 16 db 4, -1 + const tab_LumaCoeffVer, times 8 db 0, 0 times 8 db 0, 64 times 8 db 0, 0 @@ -133,10 +256,44 @@ times 16 db 58, -10 times 16 db 4, -1 +const tab_ChromaCoeffVer_32, times 16 db 0, 64 + times 16 db 0, 0 + + times 16 db -2, 58 + times 16 db 10, -2 + + times 16 db -4, 54 + times 16 db 16, -2 + + times 16 db -6, 46 + times 16 db 28, -4 + + times 16 db -4, 36 + times 16 db 36, -4 + + times 16 db -4, 28 + times 16 db 46, -6 + + times 16 db -2, 16 + times 16 db 54, -4 + + times 16 db -2, 10 + times 16 db 58, -2 + const tab_c_64_n64, times 8 db 64, -64 +const interp4_shuf, times 2 db 0, 1, 8, 9, 4, 5, 12, 13, 2, 3, 10, 11, 6, 7, 14, 15 + +const interp4_horiz_shuf1, db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6 + db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 + +const interp4_hpp_shuf, times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12 + const interp8_hps_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7 +ALIGN 32 +interp4_hps_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12 + SECTION .text cextern pb_128 @@ -146,6 +303,462 @@ cextern pw_2000 cextern pw_8192 +%macro FILTER_H4_w2_2_sse2 0 + pxor m3, m3 + movd m0, [srcq - 1] + movd m2, [srcq] + punpckldq m0, m2 + punpcklbw m0, m3 + movd m1, [srcq + srcstrideq - 1] + movd m2, [srcq + srcstrideq] + punpckldq m1, m2 + punpcklbw m1, m3 + pmaddwd m0, m4 + pmaddwd m1, m4 + packssdw m0, m1 + pshuflw m1, m0, q2301 + pshufhw m1, m1, q2301 + paddw m0, m1 + psrld m0, 16 + packssdw m0, m0 + paddw m0, m5 + psraw m0, 6 + packuswb m0, m0 + movd r4, m0 + mov [dstq], r4w + shr r4, 16 + mov [dstq + dststrideq], r4w +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_2xN(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_H4_W2xN_sse3 1 +INIT_XMM sse3 +cglobal interp_4tap_horiz_pp_2x%1, 4, 6, 6, src, srcstride, dst, dststride + mov r4d, r4m + mova m5, [pw_32] + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movddup m4, [r5 + r4 * 8] +%else + movddup m4, [tabw_ChromaCoeff + r4 * 8] +%endif + +%assign x 1 +%rep %1/2 + FILTER_H4_w2_2_sse2 +%if x < %1/2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] +%endif +%assign x x+1 +%endrep + + RET + +%endmacro + + FILTER_H4_W2xN_sse3 4 + FILTER_H4_W2xN_sse3 8 + FILTER_H4_W2xN_sse3 16 + +%macro FILTER_H4_w4_2_sse2 0 + pxor m5, m5 + movd m0, [srcq - 1] + movd m6, [srcq] + punpckldq m0, m6 + punpcklbw m0, m5 + movd m1, [srcq + 1] + movd m6, [srcq + 2] + punpckldq m1, m6 + punpcklbw m1, m5 + movd m2, [srcq + srcstrideq - 1] + movd m6, [srcq + srcstrideq] + punpckldq m2, m6 + punpcklbw m2, m5 + movd m3, [srcq + srcstrideq + 1] + movd m6, [srcq + srcstrideq + 2] + punpckldq m3, m6 + punpcklbw m3, m5 + pmaddwd m0, m4 + pmaddwd m1, m4 + pmaddwd m2, m4 + pmaddwd m3, m4 + packssdw m0, m1 + packssdw m2, m3 + pshuflw m1, m0, q2301 + pshufhw m1, m1, q2301 + pshuflw m3, m2, q2301 + pshufhw m3, m3, q2301 + paddw m0, m1 + paddw m2, m3 + psrld m0, 16 + psrld m2, 16 + packssdw m0, m2 + paddw m0, m7 + psraw m0, 6 + packuswb m0, m2 + movd [dstq], m0 + psrldq m0, 4 + movd [dstq + dststrideq], m0 +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_4x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_H4_W4xN_sse3 1 +INIT_XMM sse3 +cglobal interp_4tap_horiz_pp_4x%1, 4, 6, 8, src, srcstride, dst, dststride + mov r4d, r4m + mova m7, [pw_32] + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movddup m4, [r5 + r4 * 8] +%else + movddup m4, [tabw_ChromaCoeff + r4 * 8] +%endif + +%assign x 1 +%rep %1/2 + FILTER_H4_w4_2_sse2 +%if x < %1/2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] +%endif +%assign x x+1 +%endrep + + RET + +%endmacro + + FILTER_H4_W4xN_sse3 2 + FILTER_H4_W4xN_sse3 4 + FILTER_H4_W4xN_sse3 8 + FILTER_H4_W4xN_sse3 16 + FILTER_H4_W4xN_sse3 32 + +%macro FILTER_H4_w6_sse2 0 + pxor m4, m4 + movh m0, [srcq - 1] + movh m5, [srcq] + punpckldq m0, m5 + movhlps m2, m0 + punpcklbw m0, m4 + punpcklbw m2, m4 + movd m1, [srcq + 1] + movd m5, [srcq + 2] + punpckldq m1, m5 + punpcklbw m1, m4 + pmaddwd m0, m6 + pmaddwd m1, m6 + pmaddwd m2, m6 + packssdw m0, m1 + packssdw m2, m2 + pshuflw m1, m0, q2301 + pshufhw m1, m1, q2301 + pshuflw m3, m2, q2301 + paddw m0, m1 + paddw m2, m3 + psrld m0, 16 + psrld m2, 16 + packssdw m0, m2 + paddw m0, m7 + psraw m0, 6 + packuswb m0, m0 + movd [dstq], m0 + pextrw r4d, m0, 2 + mov [dstq + 4], r4w +%endmacro + +%macro FILH4W8_sse2 1 + movh m0, [srcq - 1 + %1] + movh m5, [srcq + %1] + punpckldq m0, m5 + movhlps m2, m0 + punpcklbw m0, m4 + punpcklbw m2, m4 + movh m1, [srcq + 1 + %1] + movh m5, [srcq + 2 + %1] + punpckldq m1, m5 + movhlps m3, m1 + punpcklbw m1, m4 + punpcklbw m3, m4 + pmaddwd m0, m6 + pmaddwd m1, m6 + pmaddwd m2, m6 + pmaddwd m3, m6 + packssdw m0, m1 + packssdw m2, m3 + pshuflw m1, m0, q2301 + pshufhw m1, m1, q2301 + pshuflw m3, m2, q2301 + pshufhw m3, m3, q2301 + paddw m0, m1 + paddw m2, m3 + psrld m0, 16 + psrld m2, 16 + packssdw m0, m2 + paddw m0, m7 + psraw m0, 6 + packuswb m0, m0 + movh [dstq + %1], m0 +%endmacro + +%macro FILTER_H4_w8_sse2 0 + FILH4W8_sse2 0 +%endmacro + +%macro FILTER_H4_w12_sse2 0 + FILH4W8_sse2 0 + movd m1, [srcq - 1 + 8] + movd m3, [srcq + 8] + punpckldq m1, m3 + punpcklbw m1, m4 + movd m2, [srcq + 1 + 8] + movd m3, [srcq + 2 + 8] + punpckldq m2, m3 + punpcklbw m2, m4 + pmaddwd m1, m6 + pmaddwd m2, m6 + packssdw m1, m2 + pshuflw m2, m1, q2301 + pshufhw m2, m2, q2301 + paddw m1, m2 + psrld m1, 16 + packssdw m1, m1 + paddw m1, m7 + psraw m1, 6 + packuswb m1, m1 + movd [dstq + 8], m1 +%endmacro + +%macro FILTER_H4_w16_sse2 0 + FILH4W8_sse2 0 + FILH4W8_sse2 8 +%endmacro + +%macro FILTER_H4_w24_sse2 0 + FILH4W8_sse2 0 + FILH4W8_sse2 8 + FILH4W8_sse2 16 +%endmacro + +%macro FILTER_H4_w32_sse2 0 + FILH4W8_sse2 0 + FILH4W8_sse2 8 + FILH4W8_sse2 16 + FILH4W8_sse2 24 +%endmacro + +%macro FILTER_H4_w48_sse2 0 + FILH4W8_sse2 0 + FILH4W8_sse2 8 + FILH4W8_sse2 16 + FILH4W8_sse2 24 + FILH4W8_sse2 32 + FILH4W8_sse2 40 +%endmacro + +%macro FILTER_H4_w64_sse2 0 + FILH4W8_sse2 0 + FILH4W8_sse2 8 + FILH4W8_sse2 16 + FILH4W8_sse2 24 + FILH4W8_sse2 32 + FILH4W8_sse2 40 + FILH4W8_sse2 48 + FILH4W8_sse2 56 +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_sse3 2 +INIT_XMM sse3 +cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 8, src, srcstride, dst, dststride + mov r4d, r4m + mova m7, [pw_32] + pxor m4, m4 + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movddup m6, [r5 + r4 * 8] +%else + movddup m6, [tabw_ChromaCoeff + r4 * 8] +%endif + +%assign x 1 +%rep %2 + FILTER_H4_w%1_sse2 +%if x < %2 + add srcq, srcstrideq + add dstq, dststrideq +%endif +%assign x x+1 +%endrep + + RET + +%endmacro + + IPFILTER_CHROMA_sse3 6, 8 + IPFILTER_CHROMA_sse3 8, 2 + IPFILTER_CHROMA_sse3 8, 4 + IPFILTER_CHROMA_sse3 8, 6 + IPFILTER_CHROMA_sse3 8, 8 + IPFILTER_CHROMA_sse3 8, 16 + IPFILTER_CHROMA_sse3 8, 32 + IPFILTER_CHROMA_sse3 12, 16 + + IPFILTER_CHROMA_sse3 6, 16 + IPFILTER_CHROMA_sse3 8, 12 + IPFILTER_CHROMA_sse3 8, 64 + IPFILTER_CHROMA_sse3 12, 32 + + IPFILTER_CHROMA_sse3 16, 4 + IPFILTER_CHROMA_sse3 16, 8 + IPFILTER_CHROMA_sse3 16, 12 + IPFILTER_CHROMA_sse3 16, 16 + IPFILTER_CHROMA_sse3 16, 32 + IPFILTER_CHROMA_sse3 32, 8 + IPFILTER_CHROMA_sse3 32, 16 + IPFILTER_CHROMA_sse3 32, 24 + IPFILTER_CHROMA_sse3 24, 32 + IPFILTER_CHROMA_sse3 32, 32 + + IPFILTER_CHROMA_sse3 16, 24 + IPFILTER_CHROMA_sse3 16, 64 + IPFILTER_CHROMA_sse3 32, 48 + IPFILTER_CHROMA_sse3 24, 64 + IPFILTER_CHROMA_sse3 32, 64 + + IPFILTER_CHROMA_sse3 64, 64 + IPFILTER_CHROMA_sse3 64, 32 + IPFILTER_CHROMA_sse3 64, 48 + IPFILTER_CHROMA_sse3 48, 64 + IPFILTER_CHROMA_sse3 64, 16 + +%macro FILTER_2 2 + movd m3, [srcq + %1] + movd m4, [srcq + 1 + %1] + punpckldq m3, m4 + punpcklbw m3, m0 + pmaddwd m3, m1 + packssdw m3, m3 + pshuflw m4, m3, q2301 + paddw m3, m4 + psrldq m3, 2 + psubw m3, m2 + movd [dstq + %2], m3 +%endmacro + +%macro FILTER_4 2 + movd m3, [srcq + %1] + movd m4, [srcq + 1 + %1] + punpckldq m3, m4 + punpcklbw m3, m0 + pmaddwd m3, m1 + movd m4, [srcq + 2 + %1] + movd m5, [srcq + 3 + %1] + punpckldq m4, m5 + punpcklbw m4, m0 + pmaddwd m4, m1 + packssdw m3, m4 + pshuflw m4, m3, q2301 + pshufhw m4, m4, q2301 + paddw m3, m4 + psrldq m3, 2 + pshufd m3, m3, q3120 + psubw m3, m2 + movh [dstq + %2], m3 +%endmacro + +%macro FILTER_4TAP_HPS_sse3 2 +INIT_XMM sse3 +cglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 6, src, srcstride, dst, dststride + mov r4d, r4m + add dststrided, dststrided + mova m2, [pw_2000] + pxor m0, m0 + +%ifdef PIC + lea r6, [tabw_ChromaCoeff] + movddup m1, [r6 + r4 * 8] +%else + movddup m1, [tabw_ChromaCoeff + r4 * 8] +%endif + + mov r4d, %2 + cmp r5m, byte 0 + je .loopH + sub srcq, srcstrideq + add r4d, 3 + +.loopH: +%assign x -1 +%assign y 0 +%rep %1/4 + FILTER_4 x,y +%assign x x+4 +%assign y y+8 +%endrep +%rep (%1 % 4)/2 + FILTER_2 x,y +%endrep + add srcq, srcstrideq + add dstq, dststrideq + + dec r4d + jnz .loopH + RET + +%endmacro + + FILTER_4TAP_HPS_sse3 2, 4 + FILTER_4TAP_HPS_sse3 2, 8 + FILTER_4TAP_HPS_sse3 2, 16 + FILTER_4TAP_HPS_sse3 4, 2 + FILTER_4TAP_HPS_sse3 4, 4 + FILTER_4TAP_HPS_sse3 4, 8 + FILTER_4TAP_HPS_sse3 4, 16 + FILTER_4TAP_HPS_sse3 4, 32 + FILTER_4TAP_HPS_sse3 6, 8 + FILTER_4TAP_HPS_sse3 6, 16 + FILTER_4TAP_HPS_sse3 8, 2 + FILTER_4TAP_HPS_sse3 8, 4 + FILTER_4TAP_HPS_sse3 8, 6 + FILTER_4TAP_HPS_sse3 8, 8 + FILTER_4TAP_HPS_sse3 8, 12 + FILTER_4TAP_HPS_sse3 8, 16 + FILTER_4TAP_HPS_sse3 8, 32 + FILTER_4TAP_HPS_sse3 8, 64 + FILTER_4TAP_HPS_sse3 12, 16 + FILTER_4TAP_HPS_sse3 12, 32 + FILTER_4TAP_HPS_sse3 16, 4 + FILTER_4TAP_HPS_sse3 16, 8 + FILTER_4TAP_HPS_sse3 16, 12 + FILTER_4TAP_HPS_sse3 16, 16 + FILTER_4TAP_HPS_sse3 16, 24 + FILTER_4TAP_HPS_sse3 16, 32 + FILTER_4TAP_HPS_sse3 16, 64 + FILTER_4TAP_HPS_sse3 24, 32 + FILTER_4TAP_HPS_sse3 24, 64 + FILTER_4TAP_HPS_sse3 32, 8 + FILTER_4TAP_HPS_sse3 32, 16 + FILTER_4TAP_HPS_sse3 32, 24 + FILTER_4TAP_HPS_sse3 32, 32 + FILTER_4TAP_HPS_sse3 32, 48 + FILTER_4TAP_HPS_sse3 32, 64 + FILTER_4TAP_HPS_sse3 48, 64 + FILTER_4TAP_HPS_sse3 64, 16 + FILTER_4TAP_HPS_sse3 64, 32 + FILTER_4TAP_HPS_sse3 64, 48 + FILTER_4TAP_HPS_sse3 64, 64 + %macro FILTER_H8_W8_sse2 0 movh m1, [r0 + x - 3] movh m4, [r0 + x - 2] @@ -242,6 +855,137 @@ psrldq m1, 2 pshufd m1, m1, q3120 %endmacro + +;---------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;---------------------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_LUMA_sse2 3 +INIT_XMM sse2 +cglobal interp_8tap_horiz_%3_%1x%2, 4,6,8 + mov r4d, r4m + add r4d, r4d + pxor m6, m6 + +%ifidn %3, ps + add r3d, r3d + cmp r5m, byte 0 +%endif + +%ifdef PIC + lea r5, [tabw_LumaCoeff] + movu m3, [r5 + r4 * 8] +%else + movu m3, [tabw_LumaCoeff + r4 * 8] +%endif + + mov r4d, %2 + +%ifidn %3, pp + mova m2, [pw_32] +%else + mova m2, [pw_2000] + je .loopH + lea r5, [r1 + 2 * r1] + sub r0, r5 + add r4d, 7 +%endif + +.loopH: +%assign x 0 +%rep %1 / 8 + FILTER_H8_W8_sse2 + %ifidn %3, pp + paddw m1, m2 + psraw m1, 6 + packuswb m1, m1 + movh [r2 + x], m1 + %else + psubw m1, m2 + movu [r2 + 2 * x], m1 + %endif +%assign x x+8 +%endrep + +%rep (%1 % 8) / 4 + FILTER_H8_W4_sse2 + %ifidn %3, pp + paddw m1, m2 + psraw m1, 6 + packuswb m1, m1 + movd [r2 + x], m1 + %else + psubw m1, m2 + movh [r2 + 2 * x], m1 + %endif +%endrep + + add r0, r1 + add r2, r3 + + dec r4d + jnz .loopH + RET + +%endmacro + +;-------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;-------------------------------------------------------------------------------------------------------------- + IPFILTER_LUMA_sse2 4, 4, pp + IPFILTER_LUMA_sse2 4, 8, pp + IPFILTER_LUMA_sse2 8, 4, pp + IPFILTER_LUMA_sse2 8, 8, pp + IPFILTER_LUMA_sse2 16, 16, pp + IPFILTER_LUMA_sse2 16, 8, pp + IPFILTER_LUMA_sse2 8, 16, pp + IPFILTER_LUMA_sse2 16, 12, pp + IPFILTER_LUMA_sse2 12, 16, pp + IPFILTER_LUMA_sse2 16, 4, pp + IPFILTER_LUMA_sse2 4, 16, pp + IPFILTER_LUMA_sse2 32, 32, pp + IPFILTER_LUMA_sse2 32, 16, pp + IPFILTER_LUMA_sse2 16, 32, pp + IPFILTER_LUMA_sse2 32, 24, pp + IPFILTER_LUMA_sse2 24, 32, pp + IPFILTER_LUMA_sse2 32, 8, pp + IPFILTER_LUMA_sse2 8, 32, pp + IPFILTER_LUMA_sse2 64, 64, pp + IPFILTER_LUMA_sse2 64, 32, pp + IPFILTER_LUMA_sse2 32, 64, pp + IPFILTER_LUMA_sse2 64, 48, pp + IPFILTER_LUMA_sse2 48, 64, pp + IPFILTER_LUMA_sse2 64, 16, pp + IPFILTER_LUMA_sse2 16, 64, pp + +;---------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;---------------------------------------------------------------------------------------------------------------------------- + IPFILTER_LUMA_sse2 4, 4, ps + IPFILTER_LUMA_sse2 8, 8, ps + IPFILTER_LUMA_sse2 8, 4, ps + IPFILTER_LUMA_sse2 4, 8, ps + IPFILTER_LUMA_sse2 16, 16, ps + IPFILTER_LUMA_sse2 16, 8, ps + IPFILTER_LUMA_sse2 8, 16, ps + IPFILTER_LUMA_sse2 16, 12, ps + IPFILTER_LUMA_sse2 12, 16, ps + IPFILTER_LUMA_sse2 16, 4, ps + IPFILTER_LUMA_sse2 4, 16, ps + IPFILTER_LUMA_sse2 32, 32, ps + IPFILTER_LUMA_sse2 32, 16, ps + IPFILTER_LUMA_sse2 16, 32, ps + IPFILTER_LUMA_sse2 32, 24, ps + IPFILTER_LUMA_sse2 24, 32, ps + IPFILTER_LUMA_sse2 32, 8, ps + IPFILTER_LUMA_sse2 8, 32, ps + IPFILTER_LUMA_sse2 64, 64, ps + IPFILTER_LUMA_sse2 64, 32, ps + IPFILTER_LUMA_sse2 32, 64, ps + IPFILTER_LUMA_sse2 64, 48, ps + IPFILTER_LUMA_sse2 48, 64, ps + IPFILTER_LUMA_sse2 64, 16, ps + IPFILTER_LUMA_sse2 16, 64, ps + %macro PROCESS_LUMA_W4_4R_sse2 0 movd m2, [r0] movd m7, [r0 + r1] @@ -601,6 +1345,1945 @@ FILTER_VER_LUMA_sse2 64, 64, ps %endif +%macro WORD_TO_DOUBLE 1 +%if ARCH_X86_64 + punpcklbw %1, m8 +%else + punpcklbw %1, %1 + psrlw %1, 8 +%endif +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_%1_2x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W2_H4_sse2 2 +INIT_XMM sse2 +%if ARCH_X86_64 +cglobal interp_4tap_vert_%1_2x%2, 4, 6, 9 + pxor m8, m8 +%else +cglobal interp_4tap_vert_%1_2x%2, 4, 6, 8 +%endif + mov r4d, r4m + sub r0, r1 + +%ifidn %1,pp + mova m1, [pw_32] +%elifidn %1,ps + mova m1, [pw_2000] + add r3d, r3d +%endif + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movh m0, [r5 + r4 * 8] +%else + movh m0, [tabw_ChromaCoeff + r4 * 8] +%endif + + punpcklqdq m0, m0 + lea r5, [3 * r1] + +%assign x 1 +%rep %2/4 + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r5] + + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklwd m2, m6 + + WORD_TO_DOUBLE m2 + pmaddwd m2, m0 + + lea r0, [r0 + 4 * r1] + movd m6, [r0] + + punpcklbw m3, m4 + punpcklbw m7, m5, m6 + punpcklwd m3, m7 + + WORD_TO_DOUBLE m3 + pmaddwd m3, m0 + + packssdw m2, m3 + pshuflw m3, m2, q2301 + pshufhw m3, m3, q2301 + paddw m2, m3 + + movd m7, [r0 + r1] + + punpcklbw m4, m5 + punpcklbw m3, m6, m7 + punpcklwd m4, m3 + + WORD_TO_DOUBLE m4 + pmaddwd m4, m0 + + movd m3, [r0 + 2 * r1] + + punpcklbw m5, m6 + punpcklbw m7, m3 + punpcklwd m5, m7 + + WORD_TO_DOUBLE m5 + pmaddwd m5, m0 + + packssdw m4, m5 + pshuflw m5, m4, q2301 + pshufhw m5, m5, q2301 + paddw m4, m5 + +%ifidn %1,pp + psrld m2, 16 + psrld m4, 16 + packssdw m2, m4 + paddw m2, m1 + psraw m2, 6 + packuswb m2, m2 + +%if ARCH_X86_64 + movq r4, m2 + mov [r2], r4w + shr r4, 16 + mov [r2 + r3], r4w + lea r2, [r2 + 2 * r3] + shr r4, 16 + mov [r2], r4w + shr r4, 16 + mov [r2 + r3], r4w +%else + movd r4, m2 + mov [r2], r4w + shr r4, 16 + mov [r2 + r3], r4w + lea r2, [r2 + 2 * r3] + psrldq m2, 4 + movd r4, m2 + mov [r2], r4w + shr r4, 16 + mov [r2 + r3], r4w +%endif +%elifidn %1,ps + psrldq m2, 2 + psrldq m4, 2 + pshufd m2, m2, q3120 + pshufd m4, m4, q3120 + psubw m4, m1 + psubw m2, m1 + + movd [r2], m2 + psrldq m2, 4 + movd [r2 + r3], m2 + lea r2, [r2 + 2 * r3] + movd [r2], m4 + psrldq m4, 4 + movd [r2 + r3], m4 +%endif + +%if x < %2/4 + lea r2, [r2 + 2 * r3] +%endif +%assign x x+1 +%endrep + RET + +%endmacro + + FILTER_V4_W2_H4_sse2 pp, 4 + FILTER_V4_W2_H4_sse2 pp, 8 + FILTER_V4_W2_H4_sse2 pp, 16 + + FILTER_V4_W2_H4_sse2 ps, 4 + FILTER_V4_W2_H4_sse2 ps, 8 + FILTER_V4_W2_H4_sse2 ps, 16 + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_%1_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V2_W4_H4_sse2 1 +INIT_XMM sse2 +cglobal interp_4tap_vert_%1_4x2, 4, 6, 8 + mov r4d, r4m + sub r0, r1 + pxor m7, m7 + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movh m0, [r5 + r4 * 8] +%else + movh m0, [tabw_ChromaCoeff + r4 * 8] +%endif + + lea r5, [r0 + 2 * r1] + punpcklqdq m0, m0 + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r5] + movd m5, [r5 + r1] + + punpcklbw m2, m3 + punpcklbw m1, m4, m5 + punpcklwd m2, m1 + + movhlps m6, m2 + punpcklbw m2, m7 + punpcklbw m6, m7 + pmaddwd m2, m0 + pmaddwd m6, m0 + packssdw m2, m6 + + movd m1, [r0 + 4 * r1] + + punpcklbw m3, m4 + punpcklbw m5, m1 + punpcklwd m3, m5 + + movhlps m6, m3 + punpcklbw m3, m7 + punpcklbw m6, m7 + pmaddwd m3, m0 + pmaddwd m6, m0 + packssdw m3, m6 + + pshuflw m4, m2, q2301 + pshufhw m4, m4, q2301 + paddw m2, m4 + pshuflw m5, m3, q2301 + pshufhw m5, m5, q2301 + paddw m3, m5 + +%ifidn %1, pp + psrld m2, 16 + psrld m3, 16 + packssdw m2, m3 + + paddw m2, [pw_32] + psraw m2, 6 + packuswb m2, m2 + + movd [r2], m2 + psrldq m2, 4 + movd [r2 + r3], m2 +%elifidn %1, ps + psrldq m2, 2 + psrldq m3, 2 + pshufd m2, m2, q3120 + pshufd m3, m3, q3120 + punpcklqdq m2, m3 + + add r3d, r3d + psubw m2, [pw_2000] + movh [r2], m2 + movhps [r2 + r3], m2 +%endif + RET + +%endmacro + + FILTER_V2_W4_H4_sse2 pp + FILTER_V2_W4_H4_sse2 ps + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_%1_4x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W4_H4_sse2 2 +INIT_XMM sse2 +%if ARCH_X86_64 +cglobal interp_4tap_vert_%1_4x%2, 4, 6, 9 + pxor m8, m8 +%else +cglobal interp_4tap_vert_%1_4x%2, 4, 6, 8 +%endif + + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movh m0, [r5 + r4 * 8] +%else + movh m0, [tabw_ChromaCoeff + r4 * 8] +%endif + +%ifidn %1,pp + mova m1, [pw_32] +%elifidn %1,ps + add r3d, r3d + mova m1, [pw_2000] +%endif + + lea r5, [3 * r1] + lea r4, [3 * r3] + punpcklqdq m0, m0 + +%assign x 1 +%rep %2/4 + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r5] + + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklwd m2, m6 + + movhlps m6, m2 + WORD_TO_DOUBLE m2 + WORD_TO_DOUBLE m6 + pmaddwd m2, m0 + pmaddwd m6, m0 + packssdw m2, m6 + + lea r0, [r0 + 4 * r1] + movd m6, [r0] + + punpcklbw m3, m4 + punpcklbw m7, m5, m6 + punpcklwd m3, m7 + + movhlps m7, m3 + WORD_TO_DOUBLE m3 + WORD_TO_DOUBLE m7 + pmaddwd m3, m0 + pmaddwd m7, m0 + packssdw m3, m7 + + pshuflw m7, m2, q2301 + pshufhw m7, m7, q2301 + paddw m2, m7 + pshuflw m7, m3, q2301 + pshufhw m7, m7, q2301 + paddw m3, m7 + +%ifidn %1,pp + psrld m2, 16 + psrld m3, 16 + packssdw m2, m3 + paddw m2, m1 + psraw m2, 6 +%elifidn %1,ps + psrldq m2, 2 + psrldq m3, 2 + pshufd m2, m2, q3120 + pshufd m3, m3, q3120 + punpcklqdq m2, m3 + + psubw m2, m1 + movh [r2], m2 + movhps [r2 + r3], m2 +%endif + + movd m7, [r0 + r1] + + punpcklbw m4, m5 + punpcklbw m3, m6, m7 + punpcklwd m4, m3 + + movhlps m3, m4 + WORD_TO_DOUBLE m4 + WORD_TO_DOUBLE m3 + pmaddwd m4, m0 + pmaddwd m3, m0 + packssdw m4, m3 + + movd m3, [r0 + 2 * r1] + + punpcklbw m5, m6 + punpcklbw m7, m3 + punpcklwd m5, m7 + + movhlps m3, m5 + WORD_TO_DOUBLE m5 + WORD_TO_DOUBLE m3 + pmaddwd m5, m0 + pmaddwd m3, m0 + packssdw m5, m3 + + pshuflw m7, m4, q2301 + pshufhw m7, m7, q2301 + paddw m4, m7 + pshuflw m7, m5, q2301 + pshufhw m7, m7, q2301 + paddw m5, m7 + +%ifidn %1,pp + psrld m4, 16 + psrld m5, 16 + packssdw m4, m5 + + paddw m4, m1 + psraw m4, 6 + packuswb m2, m4 + + movd [r2], m2 + psrldq m2, 4 + movd [r2 + r3], m2 + psrldq m2, 4 + movd [r2 + 2 * r3], m2 + psrldq m2, 4 + movd [r2 + r4], m2 +%elifidn %1,ps + psrldq m4, 2 + psrldq m5, 2 + pshufd m4, m4, q3120 + pshufd m5, m5, q3120 + punpcklqdq m4, m5 + psubw m4, m1 + movh [r2 + 2 * r3], m4 + movhps [r2 + r4], m4 +%endif + +%if x < %2/4 + lea r2, [r2 + 4 * r3] +%endif + +%assign x x+1 +%endrep + RET + +%endmacro + + FILTER_V4_W4_H4_sse2 pp, 4 + FILTER_V4_W4_H4_sse2 pp, 8 + FILTER_V4_W4_H4_sse2 pp, 16 + FILTER_V4_W4_H4_sse2 pp, 32 + + FILTER_V4_W4_H4_sse2 ps, 4 + FILTER_V4_W4_H4_sse2 ps, 8 + FILTER_V4_W4_H4_sse2 ps, 16 + FILTER_V4_W4_H4_sse2 ps, 32 + +;----------------------------------------------------------------------------- +;void interp_4tap_vert_%1_6x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W6_H4_sse2 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_%1_6x%2, 4, 7, 10 + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + mova m6, [r5 + r4] + mova m5, [r5 + r4 + 16] +%else + mova m6, [tab_ChromaCoeffV + r4] + mova m5, [tab_ChromaCoeffV + r4 + 16] +%endif + +%ifidn %1,pp + mova m4, [pw_32] +%elifidn %1,ps + mova m4, [pw_2000] + add r3d, r3d +%endif + lea r5, [3 * r1] + +%assign x 1 +%rep %2/4 + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + movq m3, [r0 + r5] + + punpcklbw m0, m1 + punpcklbw m1, m2 + punpcklbw m2, m3 + + movhlps m7, m0 + punpcklbw m0, m9 + punpcklbw m7, m9 + pmaddwd m0, m6 + pmaddwd m7, m6 + packssdw m0, m7 + + movhlps m8, m2 + movq m7, m2 + punpcklbw m8, m9 + punpcklbw m7, m9 + pmaddwd m8, m5 + pmaddwd m7, m5 + packssdw m7, m8 + + paddw m0, m7 + +%ifidn %1,pp + paddw m0, m4 + psraw m0, 6 + packuswb m0, m0 + + movd [r2], m0 + pextrw r6d, m0, 2 + mov [r2 + 4], r6w +%elifidn %1,ps + psubw m0, m4 + movh [r2], m0 + pshufd m0, m0, 2 + movd [r2 + 8], m0 +%endif + + lea r0, [r0 + 4 * r1] + + movq m0, [r0] + punpcklbw m3, m0 + + movhlps m8, m1 + punpcklbw m1, m9 + punpcklbw m8, m9 + pmaddwd m1, m6 + pmaddwd m8, m6 + packssdw m1, m8 + + movhlps m8, m3 + movq m7, m3 + punpcklbw m8, m9 + punpcklbw m7, m9 + pmaddwd m8, m5 + pmaddwd m7, m5 + packssdw m7, m8 + + paddw m1, m7 + +%ifidn %1,pp + paddw m1, m4 + psraw m1, 6 + packuswb m1, m1 + + movd [r2 + r3], m1 + pextrw r6d, m1, 2 + mov [r2 + r3 + 4], r6w +%elifidn %1,ps + psubw m1, m4 + movh [r2 + r3], m1 + pshufd m1, m1, 2 + movd [r2 + r3 + 8], m1 +%endif + + movq m1, [r0 + r1] + punpcklbw m7, m0, m1 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m6 + pmaddwd m8, m6 + packssdw m2, m8 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m5 + pmaddwd m8, m5 + packssdw m7, m8 + + paddw m2, m7 + lea r2, [r2 + 2 * r3] + +%ifidn %1,pp + paddw m2, m4 + psraw m2, 6 + packuswb m2, m2 + movd [r2], m2 + pextrw r6d, m2, 2 + mov [r2 + 4], r6w +%elifidn %1,ps + psubw m2, m4 + movh [r2], m2 + pshufd m2, m2, 2 + movd [r2 + 8], m2 +%endif + + movq m2, [r0 + 2 * r1] + punpcklbw m1, m2 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m6 + pmaddwd m8, m6 + packssdw m3, m8 + + movhlps m8, m1 + punpcklbw m1, m9 + punpcklbw m8, m9 + pmaddwd m1, m5 + pmaddwd m8, m5 + packssdw m1, m8 + + paddw m3, m1 + +%ifidn %1,pp + paddw m3, m4 + psraw m3, 6 + packuswb m3, m3 + + movd [r2 + r3], m3 + pextrw r6d, m3, 2 + mov [r2 + r3 + 4], r6w +%elifidn %1,ps + psubw m3, m4 + movh [r2 + r3], m3 + pshufd m3, m3, 2 + movd [r2 + r3 + 8], m3 +%endif + +%if x < %2/4 + lea r2, [r2 + 2 * r3] +%endif + +%assign x x+1 +%endrep + RET + +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W6_H4_sse2 pp, 8 + FILTER_V4_W6_H4_sse2 pp, 16 + FILTER_V4_W6_H4_sse2 ps, 8 + FILTER_V4_W6_H4_sse2 ps, 16 +%endif + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_%1_8x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W8_sse2 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_%1_8x%2, 4, 7, 12 + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifidn %1,pp + mova m4, [pw_32] +%elifidn %1,ps + mova m4, [pw_2000] + add r3d, r3d +%endif + +%ifdef PIC + lea r6, [tab_ChromaCoeffV] + mova m6, [r6 + r4] + mova m5, [r6 + r4 + 16] +%else + mova m6, [tab_ChromaCoeffV + r4] + mova m5, [tab_ChromaCoeffV + r4 + 16] +%endif + + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + lea r5, [r0 + 2 * r1] + movq m3, [r5 + r1] + + punpcklbw m0, m1 + punpcklbw m7, m2, m3 + + movhlps m8, m0 + punpcklbw m0, m9 + punpcklbw m8, m9 + pmaddwd m0, m6 + pmaddwd m8, m6 + packssdw m0, m8 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m5 + pmaddwd m8, m5 + packssdw m7, m8 + + paddw m0, m7 + +%ifidn %1,pp + paddw m0, m4 + psraw m0, 6 +%elifidn %1,ps + psubw m0, m4 + movu [r2], m0 +%endif + + movq m11, [r0 + 4 * r1] + + punpcklbw m1, m2 + punpcklbw m7, m3, m11 + + movhlps m8, m1 + punpcklbw m1, m9 + punpcklbw m8, m9 + pmaddwd m1, m6 + pmaddwd m8, m6 + packssdw m1, m8 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m5 + pmaddwd m8, m5 + packssdw m7, m8 + + paddw m1, m7 + +%ifidn %1,pp + paddw m1, m4 + psraw m1, 6 + packuswb m1, m0 + + movhps [r2], m1 + movh [r2 + r3], m1 +%elifidn %1,ps + psubw m1, m4 + movu [r2 + r3], m1 +%endif +%if %2 == 2 ;end of 8x2 + RET + +%else + lea r6, [r0 + 4 * r1] + movq m1, [r6 + r1] + + punpcklbw m2, m3 + punpcklbw m7, m11, m1 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m6 + pmaddwd m8, m6 + packssdw m2, m8 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m5 + pmaddwd m8, m5 + packssdw m7, m8 + + paddw m2, m7 + +%ifidn %1,pp + paddw m2, m4 + psraw m2, 6 +%elifidn %1,ps + psubw m2, m4 + movu [r2 + 2 * r3], m2 +%endif + + movq m10, [r6 + 2 * r1] + + punpcklbw m3, m11 + punpcklbw m7, m1, m10 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m6 + pmaddwd m8, m6 + packssdw m3, m8 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m5 + pmaddwd m8, m5 + packssdw m7, m8 + + paddw m3, m7 + lea r5, [r2 + 2 * r3] + +%ifidn %1,pp + paddw m3, m4 + psraw m3, 6 + packuswb m3, m2 + + movhps [r2 + 2 * r3], m3 + movh [r5 + r3], m3 +%elifidn %1,ps + psubw m3, m4 + movu [r5 + r3], m3 +%endif +%if %2 == 4 ;end of 8x4 + RET + +%else + lea r6, [r6 + 2 * r1] + movq m3, [r6 + r1] + + punpcklbw m11, m1 + punpcklbw m7, m10, m3 + + movhlps m8, m11 + punpcklbw m11, m9 + punpcklbw m8, m9 + pmaddwd m11, m6 + pmaddwd m8, m6 + packssdw m11, m8 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m5 + pmaddwd m8, m5 + packssdw m7, m8 + + paddw m11, m7 + +%ifidn %1, pp + paddw m11, m4 + psraw m11, 6 +%elifidn %1,ps + psubw m11, m4 + movu [r2 + 4 * r3], m11 +%endif + + movq m7, [r0 + 8 * r1] + + punpcklbw m1, m10 + punpcklbw m3, m7 + + movhlps m8, m1 + punpcklbw m1, m9 + punpcklbw m8, m9 + pmaddwd m1, m6 + pmaddwd m8, m6 + packssdw m1, m8 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m5 + pmaddwd m8, m5 + packssdw m3, m8 + + paddw m1, m3 + lea r5, [r2 + 4 * r3] + +%ifidn %1,pp + paddw m1, m4 + psraw m1, 6 + packuswb m1, m11 + + movhps [r2 + 4 * r3], m1 + movh [r5 + r3], m1 +%elifidn %1,ps + psubw m1, m4 + movu [r5 + r3], m1 +%endif +%if %2 == 6 + RET + +%else + %error INVALID macro argument, only 2, 4 or 6! +%endif +%endif +%endif +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W8_sse2 pp, 2 + FILTER_V4_W8_sse2 pp, 4 + FILTER_V4_W8_sse2 pp, 6 + FILTER_V4_W8_sse2 ps, 2 + FILTER_V4_W8_sse2 ps, 4 + FILTER_V4_W8_sse2 ps, 6 +%endif + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_%1_8x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W8_H8_H16_H32_sse2 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_%1_8x%2, 4, 6, 11 + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + mova m6, [r5 + r4] + mova m5, [r5 + r4 + 16] +%else + mova m6, [tab_ChromaCoeff + r4] + mova m5, [tab_ChromaCoeff + r4 + 16] +%endif + +%ifidn %1,pp + mova m4, [pw_32] +%elifidn %1,ps + mova m4, [pw_2000] + add r3d, r3d +%endif + + lea r5, [r1 * 3] + +%assign x 1 +%rep %2/4 + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + movq m3, [r0 + r5] + + punpcklbw m0, m1 + punpcklbw m1, m2 + punpcklbw m2, m3 + + movhlps m7, m0 + punpcklbw m0, m9 + punpcklbw m7, m9 + pmaddwd m0, m6 + pmaddwd m7, m6 + packssdw m0, m7 + + movhlps m8, m2 + movq m7, m2 + punpcklbw m8, m9 + punpcklbw m7, m9 + pmaddwd m8, m5 + pmaddwd m7, m5 + packssdw m7, m8 + + paddw m0, m7 + +%ifidn %1,pp + paddw m0, m4 + psraw m0, 6 +%elifidn %1,ps + psubw m0, m4 + movu [r2], m0 +%endif + + lea r0, [r0 + 4 * r1] + movq m10, [r0] + punpcklbw m3, m10 + + movhlps m8, m1 + punpcklbw m1, m9 + punpcklbw m8, m9 + pmaddwd m1, m6 + pmaddwd m8, m6 + packssdw m1, m8 + + movhlps m8, m3 + movq m7, m3 + punpcklbw m8, m9 + punpcklbw m7, m9 + pmaddwd m8, m5 + pmaddwd m7, m5 + packssdw m7, m8 + + paddw m1, m7 + +%ifidn %1,pp + paddw m1, m4 + psraw m1, 6 + + packuswb m0, m1 + movh [r2], m0 + movhps [r2 + r3], m0 +%elifidn %1,ps + psubw m1, m4 + movu [r2 + r3], m1 +%endif + + movq m1, [r0 + r1] + punpcklbw m10, m1 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m6 + pmaddwd m8, m6 + packssdw m2, m8 + + movhlps m8, m10 + punpcklbw m10, m9 + punpcklbw m8, m9 + pmaddwd m10, m5 + pmaddwd m8, m5 + packssdw m10, m8 + + paddw m2, m10 + lea r2, [r2 + 2 * r3] + +%ifidn %1,pp + paddw m2, m4 + psraw m2, 6 +%elifidn %1,ps + psubw m2, m4 + movu [r2], m2 +%endif + + movq m7, [r0 + 2 * r1] + punpcklbw m1, m7 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m6 + pmaddwd m8, m6 + packssdw m3, m8 + + movhlps m8, m1 + punpcklbw m1, m9 + punpcklbw m8, m9 + pmaddwd m1, m5 + pmaddwd m8, m5 + packssdw m1, m8 + + paddw m3, m1 + +%ifidn %1,pp + paddw m3, m4 + psraw m3, 6 + + packuswb m2, m3 + movh [r2], m2 + movhps [r2 + r3], m2 +%elifidn %1,ps + psubw m3, m4 + movu [r2 + r3], m3 +%endif + +%if x < %2/4 + lea r2, [r2 + 2 * r3] +%endif +%endrep + RET +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W8_H8_H16_H32_sse2 pp, 8 + FILTER_V4_W8_H8_H16_H32_sse2 pp, 16 + FILTER_V4_W8_H8_H16_H32_sse2 pp, 32 + + FILTER_V4_W8_H8_H16_H32_sse2 pp, 12 + FILTER_V4_W8_H8_H16_H32_sse2 pp, 64 + + FILTER_V4_W8_H8_H16_H32_sse2 ps, 8 + FILTER_V4_W8_H8_H16_H32_sse2 ps, 16 + FILTER_V4_W8_H8_H16_H32_sse2 ps, 32 + + FILTER_V4_W8_H8_H16_H32_sse2 ps, 12 + FILTER_V4_W8_H8_H16_H32_sse2 ps, 64 +%endif + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_%1_12x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W12_H2_sse2 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_%1_12x%2, 4, 6, 11 + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifidn %1,pp + mova m6, [pw_32] +%elifidn %1,ps + mova m6, [pw_2000] + add r3d, r3d +%endif + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + mova m1, [r5 + r4] + mova m0, [r5 + r4 + 16] +%else + mova m1, [tab_ChromaCoeffV + r4] + mova m0, [tab_ChromaCoeffV + r4 + 16] +%endif + +%assign x 1 +%rep %2/2 + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m1 + pmaddwd m8, m1 + packssdw m2, m8 + + lea r0, [r0 + 2 * r1] + movu m5, [r0] + movu m7, [r0 + r1] + + punpcklbw m10, m5, m7 + movhlps m8, m10 + punpcklbw m10, m9 + punpcklbw m8, m9 + pmaddwd m10, m0 + pmaddwd m8, m0 + packssdw m10, m8 + + paddw m4, m10 + + punpckhbw m10, m5, m7 + movhlps m8, m10 + punpcklbw m10, m9 + punpcklbw m8, m9 + pmaddwd m10, m0 + pmaddwd m8, m0 + packssdw m10, m8 + + paddw m2, m10 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m2, m6 + psraw m2, 6 + + packuswb m4, m2 + movh [r2], m4 + psrldq m4, 8 + movd [r2 + 8], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m2, m6 + movu [r2], m4 + movh [r2 + 16], m2 +%endif + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m4 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m1 + pmaddwd m8, m1 + packssdw m3, m8 + + movu m5, [r0 + 2 * r1] + punpcklbw m2, m7, m5 + punpckhbw m7, m5 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m0 + pmaddwd m8, m0 + packssdw m2, m8 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m0 + pmaddwd m8, m0 + packssdw m7, m8 + + paddw m4, m2 + paddw m3, m7 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m3, m6 + psraw m3, 6 + + packuswb m4, m3 + movh [r2 + r3], m4 + psrldq m4, 8 + movd [r2 + r3 + 8], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m3, m6 + movu [r2 + r3], m4 + movh [r2 + r3 + 16], m3 +%endif + +%if x < %2/2 + lea r2, [r2 + 2 * r3] +%endif +%assign x x+1 +%endrep + RET + +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W12_H2_sse2 pp, 16 + FILTER_V4_W12_H2_sse2 pp, 32 + FILTER_V4_W12_H2_sse2 ps, 16 + FILTER_V4_W12_H2_sse2 ps, 32 +%endif + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_%1_16x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W16_H2_sse2 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_%1_16x%2, 4, 6, 11 + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifidn %1,pp + mova m6, [pw_32] +%elifidn %1,ps + mova m6, [pw_2000] + add r3d, r3d +%endif + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + mova m1, [r5 + r4] + mova m0, [r5 + r4 + 16] +%else + mova m1, [tab_ChromaCoeffV + r4] + mova m0, [tab_ChromaCoeffV + r4 + 16] +%endif + +%assign x 1 +%rep %2/2 + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m1 + pmaddwd m8, m1 + packssdw m2, m8 + + lea r0, [r0 + 2 * r1] + movu m5, [r0] + movu m10, [r0 + r1] + + punpckhbw m7, m5, m10 + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m0 + pmaddwd m8, m0 + packssdw m7, m8 + paddw m2, m7 + + punpcklbw m7, m5, m10 + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m0 + pmaddwd m8, m0 + packssdw m7, m8 + paddw m4, m7 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m2, m6 + psraw m2, 6 + + packuswb m4, m2 + movu [r2], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m2, m6 + movu [r2], m4 + movu [r2 + 16], m2 +%endif + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m1 + pmaddwd m8, m1 + packssdw m3, m8 + + movu m5, [r0 + 2 * r1] + + punpcklbw m2, m10, m5 + punpckhbw m10, m5 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m0 + pmaddwd m8, m0 + packssdw m2, m8 + + movhlps m8, m10 + punpcklbw m10, m9 + punpcklbw m8, m9 + pmaddwd m10, m0 + pmaddwd m8, m0 + packssdw m10, m8 + + paddw m4, m2 + paddw m3, m10 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m3, m6 + psraw m3, 6 + + packuswb m4, m3 + movu [r2 + r3], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m3, m6 + movu [r2 + r3], m4 + movu [r2 + r3 + 16], m3 +%endif + +%if x < %2/2 + lea r2, [r2 + 2 * r3] +%endif +%assign x x+1 +%endrep + RET + +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W16_H2_sse2 pp, 4 + FILTER_V4_W16_H2_sse2 pp, 8 + FILTER_V4_W16_H2_sse2 pp, 12 + FILTER_V4_W16_H2_sse2 pp, 16 + FILTER_V4_W16_H2_sse2 pp, 32 + + FILTER_V4_W16_H2_sse2 pp, 24 + FILTER_V4_W16_H2_sse2 pp, 64 + + FILTER_V4_W16_H2_sse2 ps, 4 + FILTER_V4_W16_H2_sse2 ps, 8 + FILTER_V4_W16_H2_sse2 ps, 12 + FILTER_V4_W16_H2_sse2 ps, 16 + FILTER_V4_W16_H2_sse2 ps, 32 + + FILTER_V4_W16_H2_sse2 ps, 24 + FILTER_V4_W16_H2_sse2 ps, 64 +%endif + +;----------------------------------------------------------------------------- +;void interp_4tap_vert_%1_24%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W24_sse2 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_%1_24x%2, 4, 6, 11 + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifidn %1,pp + mova m6, [pw_32] +%elifidn %1,ps + mova m6, [pw_2000] + add r3d, r3d +%endif + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + mova m1, [r5 + r4] + mova m0, [r5 + r4 + 16] +%else + mova m1, [tab_ChromaCoeffV + r4] + mova m0, [tab_ChromaCoeffV + r4 + 16] +%endif + +%assign x 1 +%rep %2/2 + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m1 + pmaddwd m8, m1 + packssdw m2, m8 + + lea r5, [r0 + 2 * r1] + movu m5, [r5] + movu m10, [r5 + r1] + punpcklbw m7, m5, m10 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m0 + pmaddwd m8, m0 + packssdw m7, m8 + paddw m4, m7 + + punpckhbw m7, m5, m10 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m0 + pmaddwd m8, m0 + packssdw m7, m8 + + paddw m2, m7 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m2, m6 + psraw m2, 6 + + packuswb m4, m2 + movu [r2], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m2, m6 + movu [r2], m4 + movu [r2 + 16], m2 +%endif + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m1 + pmaddwd m8, m1 + packssdw m3, m8 + + movu m2, [r5 + 2 * r1] + + punpcklbw m5, m10, m2 + punpckhbw m10, m2 + + movhlps m8, m5 + punpcklbw m5, m9 + punpcklbw m8, m9 + pmaddwd m5, m0 + pmaddwd m8, m0 + packssdw m5, m8 + + movhlps m8, m10 + punpcklbw m10, m9 + punpcklbw m8, m9 + pmaddwd m10, m0 + pmaddwd m8, m0 + packssdw m10, m8 + + paddw m4, m5 + paddw m3, m10 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m3, m6 + psraw m3, 6 + + packuswb m4, m3 + movu [r2 + r3], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m3, m6 + movu [r2 + r3], m4 + movu [r2 + r3 + 16], m3 +%endif + + movq m2, [r0 + 16] + movq m3, [r0 + r1 + 16] + movq m4, [r5 + 16] + movq m5, [r5 + r1 + 16] + + punpcklbw m2, m3 + punpcklbw m4, m5 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m0 + pmaddwd m8, m0 + packssdw m4, m8 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m1 + pmaddwd m8, m1 + packssdw m2, m8 + + paddw m2, m4 + +%ifidn %1,pp + paddw m2, m6 + psraw m2, 6 +%elifidn %1,ps + psubw m2, m6 + movu [r2 + 32], m2 +%endif + + movq m3, [r0 + r1 + 16] + movq m4, [r5 + 16] + movq m5, [r5 + r1 + 16] + movq m7, [r5 + 2 * r1 + 16] + + punpcklbw m3, m4 + punpcklbw m5, m7 + + movhlps m8, m5 + punpcklbw m5, m9 + punpcklbw m8, m9 + pmaddwd m5, m0 + pmaddwd m8, m0 + packssdw m5, m8 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m1 + pmaddwd m8, m1 + packssdw m3, m8 + + paddw m3, m5 + +%ifidn %1,pp + paddw m3, m6 + psraw m3, 6 + + packuswb m2, m3 + movh [r2 + 16], m2 + movhps [r2 + r3 + 16], m2 +%elifidn %1,ps + psubw m3, m6 + movu [r2 + r3 + 32], m3 +%endif + +%if x < %2/2 + mov r0, r5 + lea r2, [r2 + 2 * r3] +%endif +%assign x x+1 +%endrep + RET + +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W24_sse2 pp, 32 + FILTER_V4_W24_sse2 pp, 64 + FILTER_V4_W24_sse2 ps, 32 + FILTER_V4_W24_sse2 ps, 64 +%endif + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_%1_32x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W32_sse2 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_%1_32x%2, 4, 6, 10 + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifidn %1,pp + mova m6, [pw_32] +%elifidn %1,ps + mova m6, [pw_2000] + add r3d, r3d +%endif + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + mova m1, [r5 + r4] + mova m0, [r5 + r4 + 16] +%else + mova m1, [tab_ChromaCoeffV + r4] + mova m0, [tab_ChromaCoeffV + r4 + 16] +%endif + + mov r4d, %2 + +.loop: + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m1 + pmaddwd m8, m1 + packssdw m2, m8 + + lea r5, [r0 + 2 * r1] + movu m3, [r5] + movu m5, [r5 + r1] + + punpcklbw m7, m3, m5 + punpckhbw m3, m5 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m0 + pmaddwd m8, m0 + packssdw m7, m8 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m0 + pmaddwd m8, m0 + packssdw m3, m8 + + paddw m4, m7 + paddw m2, m3 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m2, m6 + psraw m2, 6 + + packuswb m4, m2 + movu [r2], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m2, m6 + movu [r2], m4 + movu [r2 + 16], m2 +%endif + + movu m2, [r0 + 16] + movu m3, [r0 + r1 + 16] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m1 + pmaddwd m8, m1 + packssdw m2, m8 + + movu m3, [r5 + 16] + movu m5, [r5 + r1 + 16] + + punpcklbw m7, m3, m5 + punpckhbw m3, m5 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m0 + pmaddwd m8, m0 + packssdw m7, m8 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m0 + pmaddwd m8, m0 + packssdw m3, m8 + + paddw m4, m7 + paddw m2, m3 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m2, m6 + psraw m2, 6 + + packuswb m4, m2 + movu [r2 + 16], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m2, m6 + movu [r2 + 32], m4 + movu [r2 + 48], m2 +%endif + + lea r0, [r0 + r1] + lea r2, [r2 + r3] + dec r4 + jnz .loop + RET + +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W32_sse2 pp, 8 + FILTER_V4_W32_sse2 pp, 16 + FILTER_V4_W32_sse2 pp, 24 + FILTER_V4_W32_sse2 pp, 32 + + FILTER_V4_W32_sse2 pp, 48 + FILTER_V4_W32_sse2 pp, 64 + + FILTER_V4_W32_sse2 ps, 8 + FILTER_V4_W32_sse2 ps, 16 + FILTER_V4_W32_sse2 ps, 24 + FILTER_V4_W32_sse2 ps, 32 + + FILTER_V4_W32_sse2 ps, 48 + FILTER_V4_W32_sse2 ps, 64 +%endif + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_%1_%2x%3(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W16n_H2_sse2 3 +INIT_XMM sse2 +cglobal interp_4tap_vert_%1_%2x%3, 4, 7, 11 + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifidn %1,pp + mova m7, [pw_32] +%elifidn %1,ps + mova m7, [pw_2000] + add r3d, r3d +%endif + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + mova m1, [r5 + r4] + mova m0, [r5 + r4 + 16] +%else + mova m1, [tab_ChromaCoeffV + r4] + mova m0, [tab_ChromaCoeffV + r4 + 16] +%endif + + mov r4d, %3/2 + +.loop: + + mov r6d, %2/16 + +.loopW: + + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m1 + pmaddwd m8, m1 + packssdw m2, m8 + + lea r5, [r0 + 2 * r1] + movu m5, [r5] + movu m6, [r5 + r1] + + punpckhbw m10, m5, m6 + movhlps m8, m10 + punpcklbw m10, m9 + punpcklbw m8, m9 + pmaddwd m10, m0 + pmaddwd m8, m0 + packssdw m10, m8 + paddw m2, m10 + + punpcklbw m10, m5, m6 + movhlps m8, m10 + punpcklbw m10, m9 + punpcklbw m8, m9 + pmaddwd m10, m0 + pmaddwd m8, m0 + packssdw m10, m8 + paddw m4, m10 + +%ifidn %1,pp + paddw m4, m7 + psraw m4, 6 + paddw m2, m7 + psraw m2, 6 + + packuswb m4, m2 + movu [r2], m4 +%elifidn %1,ps + psubw m4, m7 + psubw m2, m7 + movu [r2], m4 + movu [r2 + 16], m2 +%endif + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m1 + pmaddwd m8, m1 + packssdw m3, m8 + + movu m5, [r5 + 2 * r1] + + punpcklbw m2, m6, m5 + punpckhbw m6, m5 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m0 + pmaddwd m8, m0 + packssdw m2, m8 + + movhlps m8, m6 + punpcklbw m6, m9 + punpcklbw m8, m9 + pmaddwd m6, m0 + pmaddwd m8, m0 + packssdw m6, m8 + + paddw m4, m2 + paddw m3, m6 + +%ifidn %1,pp + paddw m4, m7 + psraw m4, 6 + paddw m3, m7 + psraw m3, 6 + + packuswb m4, m3 + movu [r2 + r3], m4 + add r2, 16 +%elifidn %1,ps + psubw m4, m7 + psubw m3, m7 + movu [r2 + r3], m4 + movu [r2 + r3 + 16], m3 + add r2, 32 +%endif + + add r0, 16 + dec r6d + jnz .loopW + + lea r0, [r0 + r1 * 2 - %2] + +%ifidn %1,pp + lea r2, [r2 + r3 * 2 - %2] +%elifidn %1,ps + lea r2, [r2 + r3 * 2 - (%2 * 2)] +%endif + + dec r4d + jnz .loop + RET + +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W16n_H2_sse2 pp, 64, 64 + FILTER_V4_W16n_H2_sse2 pp, 64, 32 + FILTER_V4_W16n_H2_sse2 pp, 64, 48 + FILTER_V4_W16n_H2_sse2 pp, 48, 64 + FILTER_V4_W16n_H2_sse2 pp, 64, 16 + FILTER_V4_W16n_H2_sse2 ps, 64, 64 + FILTER_V4_W16n_H2_sse2 ps, 64, 32 + FILTER_V4_W16n_H2_sse2 ps, 64, 48 + FILTER_V4_W16n_H2_sse2 ps, 48, 64 + FILTER_V4_W16n_H2_sse2 ps, 64, 16 +%endif + %macro FILTER_P2S_2_4_sse2 1 movd m2, [r0 + %1] movd m3, [r0 + r1 + %1] @@ -778,6 +3461,577 @@ FILTER_PIX_TO_SHORT_sse2 64, 48 FILTER_PIX_TO_SHORT_sse2 64, 64 +%macro FILTER_H4_w2_2 3 + movh %2, [srcq - 1] + pshufb %2, %2, Tm0 + movh %1, [srcq + srcstrideq - 1] + pshufb %1, %1, Tm0 + punpcklqdq %2, %1 + pmaddubsw %2, coef2 + phaddw %2, %2 + pmulhrsw %2, %3 + packuswb %2, %2 + movd r4, %2 + mov [dstq], r4w + shr r4, 16 + mov [dstq + dststrideq], r4w +%endmacro + + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal interp_4tap_horiz_pp_2x4, 4, 6, 5, src, srcstride, dst, dststride +%define coef2 m4 +%define Tm0 m3 +%define t2 m2 +%define t1 m1 +%define t0 m0 + + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] + +%rep 2 + FILTER_H4_w2_2 t0, t1, t2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] +%endrep + + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal interp_4tap_horiz_pp_2x8, 4, 6, 5, src, srcstride, dst, dststride +%define coef2 m4 +%define Tm0 m3 +%define t2 m2 +%define t1 m1 +%define t0 m0 + + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] + +%rep 4 + FILTER_H4_w2_2 t0, t1, t2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] +%endrep + + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_2x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal interp_4tap_horiz_pp_2x16, 4, 6, 5, src, srcstride, dst, dststride +%define coef2 m4 +%define Tm0 m3 +%define t2 m2 +%define t1 m1 +%define t0 m0 + + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] + + mov r5d, 16/2 + +.loop: + FILTER_H4_w2_2 t0, t1, t2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] + dec r5d + jnz .loop + + RET + +%macro FILTER_H4_w4_2 3 + movh %2, [srcq - 1] + pshufb %2, %2, Tm0 + pmaddubsw %2, coef2 + movh %1, [srcq + srcstrideq - 1] + pshufb %1, %1, Tm0 + pmaddubsw %1, coef2 + phaddw %2, %1 + pmulhrsw %2, %3 + packuswb %2, %2 + movd [dstq], %2 + palignr %2, %2, 4 + movd [dstq + dststrideq], %2 +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal interp_4tap_horiz_pp_4x2, 4, 6, 5, src, srcstride, dst, dststride +%define coef2 m4 +%define Tm0 m3 +%define t2 m2 +%define t1 m1 +%define t0 m0 + + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] + + FILTER_H4_w4_2 t0, t1, t2 + + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal interp_4tap_horiz_pp_4x4, 4, 6, 5, src, srcstride, dst, dststride +%define coef2 m4 +%define Tm0 m3 +%define t2 m2 +%define t1 m1 +%define t0 m0 + + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] + +%rep 2 + FILTER_H4_w4_2 t0, t1, t2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] +%endrep + + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal interp_4tap_horiz_pp_4x8, 4, 6, 5, src, srcstride, dst, dststride +%define coef2 m4 +%define Tm0 m3 +%define t2 m2 +%define t1 m1 +%define t0 m0 + + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] + +%rep 4 + FILTER_H4_w4_2 t0, t1, t2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] +%endrep + + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal interp_4tap_horiz_pp_4x16, 4, 6, 5, src, srcstride, dst, dststride +%define coef2 m4 +%define Tm0 m3 +%define t2 m2 +%define t1 m1 +%define t0 m0 + + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] + +%rep 8 + FILTER_H4_w4_2 t0, t1, t2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] +%endrep + + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_4x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal interp_4tap_horiz_pp_4x32, 4, 6, 5, src, srcstride, dst, dststride +%define coef2 m4 +%define Tm0 m3 +%define t2 m2 +%define t1 m1 +%define t0 m0 + + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] + + mov r5d, 32/2 + +.loop: + FILTER_H4_w4_2 t0, t1, t2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] + dec r5d + jnz .loop + + RET + +ALIGN 32 +const interp_4tap_8x8_horiz_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7 + + +%macro FILTER_H4_w6 3 + movu %1, [srcq - 1] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + pmulhrsw %2, %3 + packuswb %2, %2 + movd [dstq], %2 + pextrw [dstq + 4], %2, 2 +%endmacro + +%macro FILTER_H4_w8 3 + movu %1, [srcq - 1] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + pmulhrsw %2, %3 + packuswb %2, %2 + movh [dstq], %2 +%endmacro + +%macro FILTER_H4_w12 3 + movu %1, [srcq - 1] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + pmulhrsw %2, %3 + movu %1, [srcq - 1 + 8] + pshufb %1, %1, Tm0 + pmaddubsw %1, coef2 + phaddw %1, %1 + pmulhrsw %1, %3 + packuswb %2, %1 + movh [dstq], %2 + pextrd [dstq + 8], %2, 2 +%endmacro + +%macro FILTER_H4_w16 4 + movu %1, [srcq - 1] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + movu %1, [srcq - 1 + 8] + pshufb %4, %1, Tm0 + pmaddubsw %4, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %4, %1 + pmulhrsw %2, %3 + pmulhrsw %4, %3 + packuswb %2, %4 + movu [dstq], %2 +%endmacro + +%macro FILTER_H4_w24 4 + movu %1, [srcq - 1] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + movu %1, [srcq - 1 + 8] + pshufb %4, %1, Tm0 + pmaddubsw %4, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %4, %1 + pmulhrsw %2, %3 + pmulhrsw %4, %3 + packuswb %2, %4 + movu [dstq], %2 + movu %1, [srcq - 1 + 16] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + pmulhrsw %2, %3 + packuswb %2, %2 + movh [dstq + 16], %2 +%endmacro + +%macro FILTER_H4_w32 4 + movu %1, [srcq - 1] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + movu %1, [srcq - 1 + 8] + pshufb %4, %1, Tm0 + pmaddubsw %4, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %4, %1 + pmulhrsw %2, %3 + pmulhrsw %4, %3 + packuswb %2, %4 + movu [dstq], %2 + movu %1, [srcq - 1 + 16] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + movu %1, [srcq - 1 + 24] + pshufb %4, %1, Tm0 + pmaddubsw %4, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %4, %1 + pmulhrsw %2, %3 + pmulhrsw %4, %3 + packuswb %2, %4 + movu [dstq + 16], %2 +%endmacro + +%macro FILTER_H4_w16o 5 + movu %1, [srcq + %5 - 1] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + movu %1, [srcq + %5 - 1 + 8] + pshufb %4, %1, Tm0 + pmaddubsw %4, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %4, %1 + pmulhrsw %2, %3 + pmulhrsw %4, %3 + packuswb %2, %4 + movu [dstq + %5], %2 +%endmacro + +%macro FILTER_H4_w48 4 + FILTER_H4_w16o %1, %2, %3, %4, 0 + FILTER_H4_w16o %1, %2, %3, %4, 16 + FILTER_H4_w16o %1, %2, %3, %4, 32 +%endmacro + +%macro FILTER_H4_w64 4 + FILTER_H4_w16o %1, %2, %3, %4, 0 + FILTER_H4_w16o %1, %2, %3, %4, 16 + FILTER_H4_w16o %1, %2, %3, %4, 32 + FILTER_H4_w16o %1, %2, %3, %4, 48 +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro IPFILTER_CHROMA 2 +INIT_XMM sse4 +cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 6, src, srcstride, dst, dststride +%define coef2 m5 +%define Tm0 m4 +%define Tm1 m3 +%define t2 m2 +%define t1 m1 +%define t0 m0 + + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + mov r5d, %2 + + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] + mova Tm1, [tab_Tm + 16] + +.loop: + FILTER_H4_w%1 t0, t1, t2 + add srcq, srcstrideq + add dstq, dststrideq + + dec r5d + jnz .loop + + RET +%endmacro + + + IPFILTER_CHROMA 6, 8 + IPFILTER_CHROMA 8, 2 + IPFILTER_CHROMA 8, 4 + IPFILTER_CHROMA 8, 6 + IPFILTER_CHROMA 8, 8 + IPFILTER_CHROMA 8, 16 + IPFILTER_CHROMA 8, 32 + IPFILTER_CHROMA 12, 16 + + IPFILTER_CHROMA 6, 16 + IPFILTER_CHROMA 8, 12 + IPFILTER_CHROMA 8, 64 + IPFILTER_CHROMA 12, 32 + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_W 2 +INIT_XMM sse4 +cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 7, src, srcstride, dst, dststride +%define coef2 m6 +%define Tm0 m5 +%define Tm1 m4 +%define t3 m3 +%define t2 m2 +%define t1 m1 +%define t0 m0 + + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + mov r5d, %2 + + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] + mova Tm1, [tab_Tm + 16] + +.loop: + FILTER_H4_w%1 t0, t1, t2, t3 + add srcq, srcstrideq + add dstq, dststrideq + + dec r5d + jnz .loop + + RET +%endmacro + + IPFILTER_CHROMA_W 16, 4 + IPFILTER_CHROMA_W 16, 8 + IPFILTER_CHROMA_W 16, 12 + IPFILTER_CHROMA_W 16, 16 + IPFILTER_CHROMA_W 16, 32 + IPFILTER_CHROMA_W 32, 8 + IPFILTER_CHROMA_W 32, 16 + IPFILTER_CHROMA_W 32, 24 + IPFILTER_CHROMA_W 24, 32 + IPFILTER_CHROMA_W 32, 32 + + IPFILTER_CHROMA_W 16, 24 + IPFILTER_CHROMA_W 16, 64 + IPFILTER_CHROMA_W 32, 48 + IPFILTER_CHROMA_W 24, 64 + IPFILTER_CHROMA_W 32, 64 + + IPFILTER_CHROMA_W 64, 64 + IPFILTER_CHROMA_W 64, 32 + IPFILTER_CHROMA_W 64, 48 + IPFILTER_CHROMA_W 48, 64 + IPFILTER_CHROMA_W 64, 16 + + %macro FILTER_H8_W8 7-8 ; t0, t1, t2, t3, coef, c512, src, dst movu %1, %7 pshufb %2, %1, [tab_Lm + 0] @@ -798,6 +4052,1728 @@ %endif %endmacro +%macro FILTER_H8_W4 2 + movu %1, [r0 - 3 + r5] + pshufb %2, %1, [tab_Lm] + pmaddubsw %2, m3 + pshufb m7, %1, [tab_Lm + 16] + pmaddubsw m7, m3 + phaddw %2, m7 + phaddw %2, %2 +%endmacro + +;---------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;---------------------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_LUMA 3 +INIT_XMM sse4 +cglobal interp_8tap_horiz_%3_%1x%2, 4,7,8 + + mov r4d, r4m + +%ifdef PIC + lea r6, [tab_LumaCoeff] + movh m3, [r6 + r4 * 8] +%else + movh m3, [tab_LumaCoeff + r4 * 8] +%endif + punpcklqdq m3, m3 + +%ifidn %3, pp + mova m2, [pw_512] +%else + mova m2, [pw_2000] +%endif + + mov r4d, %2 +%ifidn %3, ps + add r3, r3 + cmp r5m, byte 0 + je .loopH + lea r6, [r1 + 2 * r1] + sub r0, r6 + add r4d, 7 +%endif + +.loopH: + xor r5, r5 +%rep %1 / 8 + %ifidn %3, pp + FILTER_H8_W8 m0, m1, m4, m5, m3, m2, [r0 - 3 + r5], [r2 + r5] + %else + FILTER_H8_W8 m0, m1, m4, m5, m3, UNUSED, [r0 - 3 + r5] + psubw m1, m2 + movu [r2 + 2 * r5], m1 + %endif + add r5, 8 +%endrep + +%rep (%1 % 8) / 4 + FILTER_H8_W4 m0, m1 + %ifidn %3, pp + pmulhrsw m1, m2 + packuswb m1, m1 + movd [r2 + r5], m1 + %else + psubw m1, m2 + movh [r2 + 2 * r5], m1 + %endif +%endrep + + add r0, r1 + add r2, r3 + + dec r4d + jnz .loopH + RET +%endmacro + + +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_4x4, 4,6,6 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + + mova m1, [tab_Lm] + vpbroadcastd m2, [pw_1] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + sub r0, 3 + ; Row 0-1 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + phaddd m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] + + ; Row 2-3 + lea r0, [r0 + r1 * 2] + vbroadcasti128 m4, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + phaddd m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] + + packssdw m3, m4 ; WORD [R3D R3C R2D R2C R1D R1C R0D R0C R3B R3A R2B R2A R1B R1A R0B R0A] + pmulhrsw m3, [pw_512] + vextracti128 xm4, m3, 1 + packuswb xm3, xm4 ; BYTE [R3D R3C R2D R2C R1D R1C R0D R0C R3B R3A R2B R2A R1B R1A R0B R0A] + pshufb xm3, [interp4_shuf] ; [row3 row1 row2 row0] + + lea r0, [r3 * 3] + movd [r2], xm3 + pextrd [r2+r3], xm3, 2 + pextrd [r2+r3*2], xm3, 1 + pextrd [r2+r0], xm3, 3 + RET + +%macro FILTER_HORIZ_LUMA_AVX2_4xN 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_pp_4x%1, 4, 6, 9 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + + mova m1, [tab_Lm] + mova m2, [pw_1] + mova m7, [interp8_hps_shuf] + mova m8, [pw_512] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + lea r4, [r1 * 3] + lea r5, [r3 * 3] + sub r0, 3 +%rep %1 / 8 + ; Row 0-1 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + phaddd m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] + + ; Row 2-3 + vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + r4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + phaddd m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] + + packssdw m3, m4 ; WORD [R3D R3C R2D R2C R1D R1C R0D R0C R3B R3A R2B R2A R1B R1A R0B R0A] + lea r0, [r0 + r1 * 4] + ; Row 4-5 + vbroadcasti128 m5, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + phaddd m5, m4 ; DWORD [R5D R5C R4D R4C R5B R5A R4B R4A] + + ; Row 6-7 + vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m6, [r0 + r4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m6, m1 + pmaddubsw m6, m0 + pmaddwd m6, m2 + phaddd m4, m6 ; DWORD [R7D R7C R6D R6C R7B R7A R6B R6A] + + packssdw m5, m4 ; WORD [R7D R7C R6D R6C R5D R5C R4D R4C R7B R7A R6B R6A R5B R5A R4B R4A] + vpermd m3, m7, m3 + vpermd m5, m7, m5 + pmulhrsw m3, m8 + pmulhrsw m5, m8 + packuswb m3, m5 + vextracti128 xm5, m3, 1 + + movd [r2], xm3 + pextrd [r2 + r3], xm3, 1 + movd [r2 + r3 * 2], xm5 + pextrd [r2 + r5], xm5, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm3, 2 + pextrd [r2 + r3], xm3, 3 + pextrd [r2 + r3 * 2], xm5, 2 + pextrd [r2 + r5], xm5, 3 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + RET +%endif +%endmacro + + FILTER_HORIZ_LUMA_AVX2_4xN 8 + FILTER_HORIZ_LUMA_AVX2_4xN 16 + +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_8x4, 4, 6, 7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + + mova m1, [tab_Lm] + mova m2, [tab_Lm + 32] + + ; register map + ; m0 - interpolate coeff + ; m1, m2 - shuffle order table + + sub r0, 3 + lea r5, [r1 * 3] + lea r4, [r3 * 3] + + ; Row 0 + vbroadcasti128 m3, [r0] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m3, m2 + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddubsw m4, m0 + phaddw m3, m4 + ; Row 1 + vbroadcasti128 m4, [r0 + r1] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m4, m2 + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + phaddw m4, m5 + + phaddw m3, m4 ; WORD [R1H R1G R1D R1C R0H R0G R0D R0C R1F R1E R1B R1A R0F R0E R0B R0A] + pmulhrsw m3, [pw_512] + + ; Row 2 + vbroadcasti128 m4, [r0 + r1 * 2] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m4, m2 + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + phaddw m4, m5 + ; Row 3 + vbroadcasti128 m5, [r0 + r5] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m6, m5, m2 + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddubsw m6, m0 + phaddw m5, m6 + + phaddw m4, m5 ; WORD [R3H R3G R3D R3C R2H R2G R2D R2C R3F R3E R3B R3A R2F R2E R2B R2A] + pmulhrsw m4, [pw_512] + + packuswb m3, m4 + vextracti128 xm4, m3, 1 + punpcklwd xm5, xm3, xm4 + + movq [r2], xm5 + movhps [r2 + r3], xm5 + + punpckhwd xm5, xm3, xm4 + movq [r2 + r3 * 2], xm5 + movhps [r2 + r4], xm5 + RET + +%macro IPFILTER_LUMA_AVX2_8xN 2 +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_%1x%2, 4, 7, 7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + + mova m1, [tab_Lm] + mova m2, [tab_Lm + 32] + + ; register map + ; m0 - interpolate coeff + ; m1, m2 - shuffle order table + + sub r0, 3 + lea r5, [r1 * 3] + lea r6, [r3 * 3] + mov r4d, %2 / 4 +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m3, m2 + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddubsw m4, m0 + phaddw m3, m4 + ; Row 1 + vbroadcasti128 m4, [r0 + r1] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m4, m2 + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + phaddw m4, m5 + + phaddw m3, m4 ; WORD [R1H R1G R1D R1C R0H R0G R0D R0C R1F R1E R1B R1A R0F R0E R0B R0A] + pmulhrsw m3, [pw_512] + + ; Row 2 + vbroadcasti128 m4, [r0 + r1 * 2] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m4, m2 + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + phaddw m4, m5 + ; Row 3 + vbroadcasti128 m5, [r0 + r5] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m6, m5, m2 + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddubsw m6, m0 + phaddw m5, m6 + + phaddw m4, m5 ; WORD [R3H R3G R3D R3C R2H R2G R2D R2C R3F R3E R3B R3A R2F R2E R2B R2A] + pmulhrsw m4, [pw_512] + + packuswb m3, m4 + vextracti128 xm4, m3, 1 + punpcklwd xm5, xm3, xm4 + + movq [r2], xm5 + movhps [r2 + r3], xm5 + + punpckhwd xm5, xm3, xm4 + movq [r2 + r3 * 2], xm5 + movhps [r2 + r6], xm5 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + dec r4d + jnz .loop + RET +%endmacro + + IPFILTER_LUMA_AVX2_8xN 8, 8 + IPFILTER_LUMA_AVX2_8xN 8, 16 + IPFILTER_LUMA_AVX2_8xN 8, 32 + +%macro IPFILTER_LUMA_AVX2 2 +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_%1x%2, 4,6,8 + sub r0, 3 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4 * 8] + vpbroadcastd m1, [tab_LumaCoeff + r4 * 8 + 4] +%endif + movu m3, [tab_Tm + 16] + vpbroadcastd m7, [pw_1] + + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m2 shuffle order table + ; m7 - pw_1 + + mov r4d, %2/2 +.loop: + ; Row 0 + vbroadcasti128 m4, [r0] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m4, m3 + pshufb m4, [tab_Tm] + pmaddubsw m4, m0 + pmaddubsw m5, m1 + paddw m4, m5 + pmaddwd m4, m7 + vbroadcasti128 m5, [r0 + 8] ; second 8 elements in Row0 + pshufb m6, m5, m3 + pshufb m5, [tab_Tm] + pmaddubsw m5, m0 + pmaddubsw m6, m1 + paddw m5, m6 + pmaddwd m5, m7 + packssdw m4, m5 ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00] + pmulhrsw m4, [pw_512] + vbroadcasti128 m2, [r0 + r1] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m2, m3 + pshufb m2, [tab_Tm] + pmaddubsw m2, m0 + pmaddubsw m5, m1 + paddw m2, m5 + pmaddwd m2, m7 + vbroadcasti128 m5, [r0 + r1 + 8] ; second 8 elements in Row0 + pshufb m6, m5, m3 + pshufb m5, [tab_Tm] + pmaddubsw m5, m0 + pmaddubsw m6, m1 + paddw m5, m6 + pmaddwd m5, m7 + packssdw m2, m5 ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00] + pmulhrsw m2, [pw_512] + packuswb m4, m2 + vpermq m4, m4, 11011000b + vextracti128 xm5, m4, 1 + pshufd xm4, xm4, 11011000b + pshufd xm5, xm5, 11011000b + movu [r2], xm4 + movu [r2+r3], xm5 + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] + dec r4d + jnz .loop + RET +%endmacro + +%macro IPFILTER_LUMA_32x_avx2 2 +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_%1x%2, 4,6,8 + sub r0, 3 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4 * 8] + vpbroadcastd m1, [tab_LumaCoeff + r4 * 8 + 4] +%endif + movu m3, [tab_Tm + 16] + vpbroadcastd m7, [pw_1] + + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m2 shuffle order table + ; m7 - pw_1 + + mov r4d, %2 +.loop: + ; Row 0 + vbroadcasti128 m4, [r0] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m4, m3 + pshufb m4, [tab_Tm] + pmaddubsw m4, m0 + pmaddubsw m5, m1 + paddw m4, m5 + pmaddwd m4, m7 + vbroadcasti128 m5, [r0 + 8] + pshufb m6, m5, m3 + pshufb m5, [tab_Tm] + pmaddubsw m5, m0 + pmaddubsw m6, m1 + paddw m5, m6 + pmaddwd m5, m7 + packssdw m4, m5 ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00] + pmulhrsw m4, [pw_512] + vbroadcasti128 m2, [r0 + 16] + pshufb m5, m2, m3 + pshufb m2, [tab_Tm] + pmaddubsw m2, m0 + pmaddubsw m5, m1 + paddw m2, m5 + pmaddwd m2, m7 + vbroadcasti128 m5, [r0 + 24] + pshufb m6, m5, m3 + pshufb m5, [tab_Tm] + pmaddubsw m5, m0 + pmaddubsw m6, m1 + paddw m5, m6 + pmaddwd m5, m7 + packssdw m2, m5 + pmulhrsw m2, [pw_512] + packuswb m4, m2 + vpermq m4, m4, 11011000b + vextracti128 xm5, m4, 1 + pshufd xm4, xm4, 11011000b + pshufd xm5, xm5, 11011000b + movu [r2], xm4 + movu [r2 + 16], xm5 + lea r0, [r0 + r1] + lea r2, [r2 + r3] + dec r4d + jnz .loop + RET +%endmacro + +%macro IPFILTER_LUMA_64x_avx2 2 +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_%1x%2, 4,6,8 + sub r0, 3 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4 * 8] + vpbroadcastd m1, [tab_LumaCoeff + r4 * 8 + 4] +%endif + movu m3, [tab_Tm + 16] + vpbroadcastd m7, [pw_1] + + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m2 shuffle order table + ; m7 - pw_1 + + mov r4d, %2 +.loop: + ; Row 0 + vbroadcasti128 m4, [r0] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m4, m3 + pshufb m4, [tab_Tm] + pmaddubsw m4, m0 + pmaddubsw m5, m1 + paddw m4, m5 + pmaddwd m4, m7 + vbroadcasti128 m5, [r0 + 8] + pshufb m6, m5, m3 + pshufb m5, [tab_Tm] + pmaddubsw m5, m0 + pmaddubsw m6, m1 + paddw m5, m6 + pmaddwd m5, m7 + packssdw m4, m5 ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00] + pmulhrsw m4, [pw_512] + vbroadcasti128 m2, [r0 + 16] + pshufb m5, m2, m3 + pshufb m2, [tab_Tm] + pmaddubsw m2, m0 + pmaddubsw m5, m1 + paddw m2, m5 + pmaddwd m2, m7 + vbroadcasti128 m5, [r0 + 24] + pshufb m6, m5, m3 + pshufb m5, [tab_Tm] + pmaddubsw m5, m0 + pmaddubsw m6, m1 + paddw m5, m6 + pmaddwd m5, m7 + packssdw m2, m5 + pmulhrsw m2, [pw_512] + packuswb m4, m2 + vpermq m4, m4, 11011000b + vextracti128 xm5, m4, 1 + pshufd xm4, xm4, 11011000b + pshufd xm5, xm5, 11011000b + movu [r2], xm4 + movu [r2 + 16], xm5 + + vbroadcasti128 m4, [r0 + 32] + pshufb m5, m4, m3 + pshufb m4, [tab_Tm] + pmaddubsw m4, m0 + pmaddubsw m5, m1 + paddw m4, m5 + pmaddwd m4, m7 + vbroadcasti128 m5, [r0 + 40] + pshufb m6, m5, m3 + pshufb m5, [tab_Tm] + pmaddubsw m5, m0 + pmaddubsw m6, m1 + paddw m5, m6 + pmaddwd m5, m7 + packssdw m4, m5 + pmulhrsw m4, [pw_512] + vbroadcasti128 m2, [r0 + 48] + pshufb m5, m2, m3 + pshufb m2, [tab_Tm] + pmaddubsw m2, m0 + pmaddubsw m5, m1 + paddw m2, m5 + pmaddwd m2, m7 + vbroadcasti128 m5, [r0 + 56] + pshufb m6, m5, m3 + pshufb m5, [tab_Tm] + pmaddubsw m5, m0 + pmaddubsw m6, m1 + paddw m5, m6 + pmaddwd m5, m7 + packssdw m2, m5 + pmulhrsw m2, [pw_512] + packuswb m4, m2 + vpermq m4, m4, 11011000b + vextracti128 xm5, m4, 1 + pshufd xm4, xm4, 11011000b + pshufd xm5, xm5, 11011000b + movu [r2 +32], xm4 + movu [r2 + 48], xm5 + + lea r0, [r0 + r1] + lea r2, [r2 + r3] + dec r4d + jnz .loop + RET +%endmacro + +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_48x64, 4,6,8 + sub r0, 3 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4 * 8] + vpbroadcastd m1, [tab_LumaCoeff + r4 * 8 + 4] +%endif + movu m3, [tab_Tm + 16] + vpbroadcastd m7, [pw_1] + + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m2 shuffle order table + ; m7 - pw_1 + + mov r4d, 64 +.loop: + ; Row 0 + vbroadcasti128 m4, [r0] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m4, m3 + pshufb m4, [tab_Tm] + pmaddubsw m4, m0 + pmaddubsw m5, m1 + paddw m4, m5 + pmaddwd m4, m7 + vbroadcasti128 m5, [r0 + 8] + pshufb m6, m5, m3 + pshufb m5, [tab_Tm] + pmaddubsw m5, m0 + pmaddubsw m6, m1 + paddw m5, m6 + pmaddwd m5, m7 + packssdw m4, m5 ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00] + pmulhrsw m4, [pw_512] + + vbroadcasti128 m2, [r0 + 16] + pshufb m5, m2, m3 + pshufb m2, [tab_Tm] + pmaddubsw m2, m0 + pmaddubsw m5, m1 + paddw m2, m5 + pmaddwd m2, m7 + vbroadcasti128 m5, [r0 + 24] + pshufb m6, m5, m3 + pshufb m5, [tab_Tm] + pmaddubsw m5, m0 + pmaddubsw m6, m1 + paddw m5, m6 + pmaddwd m5, m7 + packssdw m2, m5 + pmulhrsw m2, [pw_512] + packuswb m4, m2 + vpermq m4, m4, 11011000b + vextracti128 xm5, m4, 1 + pshufd xm4, xm4, 11011000b + pshufd xm5, xm5, 11011000b + movu [r2], xm4 + movu [r2 + 16], xm5 + + vbroadcasti128 m4, [r0 + 32] + pshufb m5, m4, m3 + pshufb m4, [tab_Tm] + pmaddubsw m4, m0 + pmaddubsw m5, m1 + paddw m4, m5 + pmaddwd m4, m7 + vbroadcasti128 m5, [r0 + 40] + pshufb m6, m5, m3 + pshufb m5, [tab_Tm] + pmaddubsw m5, m0 + pmaddubsw m6, m1 + paddw m5, m6 + pmaddwd m5, m7 + packssdw m4, m5 + pmulhrsw m4, [pw_512] + packuswb m4, m4 + vpermq m4, m4, 11011000b + pshufd xm4, xm4, 11011000b + movu [r2 + 32], xm4 + + lea r0, [r0 + r1] + lea r2, [r2 + r3] + dec r4d + jnz .loop + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_4x4, 4,6,6 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vpbroadcastd m2, [pw_1] + vbroadcasti128 m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + + ; Row 0-1 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + vinserti128 m3, m3, [r0 + r1], 1 + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 2-3 + lea r0, [r0 + r1 * 2] + vbroadcasti128 m4, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + vinserti128 m4, m4, [r0 + r1], 1 + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + pmulhrsw m3, [pw_512] + vextracti128 xm4, m3, 1 + packuswb xm3, xm4 + + lea r0, [r3 * 3] + movd [r2], xm3 + pextrd [r2+r3], xm3, 2 + pextrd [r2+r3*2], xm3, 1 + pextrd [r2+r0], xm3, 3 + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_2x4, 4, 6, 3 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + dec r0 + lea r4, [r1 * 3] + movq xm1, [r0] + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m1, m1, xm2, 1 + pshufb m1, [interp4_hpp_shuf] + pmaddubsw m1, m0 + pmaddwd m1, [pw_1] + vextracti128 xm2, m1, 1 + packssdw xm1, xm2 + pmulhrsw xm1, [pw_512] + packuswb xm1, xm1 + + lea r4, [r3 * 3] + pextrw [r2], xm1, 0 + pextrw [r2 + r3], xm1, 1 + pextrw [r2 + r3 * 2], xm1, 2 + pextrw [r2 + r4], xm1, 3 + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_2x8, 4, 6, 6 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m4, [interp4_hpp_shuf] + mova m5, [pw_1] + dec r0 + lea r4, [r1 * 3] + movq xm1, [r0] + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m1, m1, xm2, 1 + lea r0, [r0 + r1 * 4] + movq xm3, [r0] + movhps xm3, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m3, m3, xm2, 1 + + pshufb m1, m4 + pshufb m3, m4 + pmaddubsw m1, m0 + pmaddubsw m3, m0 + pmaddwd m1, m5 + pmaddwd m3, m5 + packssdw m1, m3 + pmulhrsw m1, [pw_512] + vextracti128 xm2, m1, 1 + packuswb xm1, xm2 + + lea r4, [r3 * 3] + pextrw [r2], xm1, 0 + pextrw [r2 + r3], xm1, 1 + pextrw [r2 + r3 * 2], xm1, 4 + pextrw [r2 + r4], xm1, 5 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm1, 2 + pextrw [r2 + r3], xm1, 3 + pextrw [r2 + r3 * 2], xm1, 6 + pextrw [r2 + r4], xm1, 7 + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_32x32, 4,6,7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m1, [interp4_horiz_shuf1] + vpbroadcastd m2, [pw_1] + mova m6, [pw_512] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + mov r4d, 32 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 4] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + vbroadcasti128 m4, [r0 + 16] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + 20] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vpermq m3, m3, 11011000b + + movu [r2], m3 + lea r2, [r2 + r3] + lea r0, [r0 + r1] + dec r4d + jnz .loop + RET + + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_16x16, 4, 6, 7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m6, [pw_512] + mova m1, [interp4_horiz_shuf1] + vpbroadcastd m2, [pw_1] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + mov r4d, 8 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + r1 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vpermq m3, m3, 11011000b + + vextracti128 xm4, m3, 1 + movu [r2], xm3 + movu [r2 + r3], xm4 + lea r2, [r2 + r3 * 2] + lea r0, [r0 + r1 * 2] + dec r4d + jnz .loop + RET +;-------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;-------------------------------------------------------------------------------------------------------------- + IPFILTER_LUMA 4, 4, pp + IPFILTER_LUMA 4, 8, pp + IPFILTER_LUMA 12, 16, pp + IPFILTER_LUMA 4, 16, pp + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_8x8, 4,6,6 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + movu m1, [tab_Tm] + vpbroadcastd m2, [pw_1] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + sub r0, 1 + mov r4d, 2 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, [pw_512] + lea r0, [r0 + r1 * 2] + + ; Row 2 + vbroadcasti128 m4, [r0 ] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + ; Row 3 + vbroadcasti128 m5, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, [pw_512] + + packuswb m3, m4 + mova m5, [interp_4tap_8x8_horiz_shuf] + vpermd m3, m5, m3 + vextracti128 xm4, m3, 1 + movq [r2], xm3 + movhps [r2 + r3], xm3 + lea r2, [r2 + r3 * 2] + movq [r2], xm4 + movhps [r2 + r3], xm4 + lea r2, [r2 + r3 * 2] + lea r0, [r0 + r1*2] + dec r4d + jnz .loop + RET + + IPFILTER_LUMA_AVX2 16, 4 + IPFILTER_LUMA_AVX2 16, 8 + IPFILTER_LUMA_AVX2 16, 12 + IPFILTER_LUMA_AVX2 16, 16 + IPFILTER_LUMA_AVX2 16, 32 + IPFILTER_LUMA_AVX2 16, 64 + + IPFILTER_LUMA_32x_avx2 32 , 8 + IPFILTER_LUMA_32x_avx2 32 , 16 + IPFILTER_LUMA_32x_avx2 32 , 24 + IPFILTER_LUMA_32x_avx2 32 , 32 + IPFILTER_LUMA_32x_avx2 32 , 64 + + IPFILTER_LUMA_64x_avx2 64 , 64 + IPFILTER_LUMA_64x_avx2 64 , 48 + IPFILTER_LUMA_64x_avx2 64 , 32 + IPFILTER_LUMA_64x_avx2 64 , 16 + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_8x2, 4, 6, 5 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m1, [tab_Tm] + mova m2, [pw_1] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, [pw_512] + vextracti128 xm4, m3, 1 + packuswb xm3, xm4 + pshufd xm3, xm3, 11011000b + movq [r2], xm3 + movhps [r2 + r3], xm3 + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_8x6, 4, 6, 7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m1, [tab_Tm] + mova m2, [pw_1] + mova m6, [pw_512] + lea r4, [r1 * 3] + lea r5, [r3 * 3] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + ; Row 2 + vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + ; Row 3 + vbroadcasti128 m5, [r0 + r4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + mova m5, [interp8_hps_shuf] + vpermd m3, m5, m3 + vextracti128 xm4, m3, 1 + movq [r2], xm3 + movhps [r2 + r3], xm3 + movq [r2 + r3 * 2], xm4 + movhps [r2 + r5], xm4 + lea r2, [r2 + r3 * 4] + lea r0, [r0 + r1 * 4] + ; Row 4 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 5 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + vextracti128 xm4, m3, 1 + packuswb xm3, xm4 + pshufd xm3, xm3, 11011000b + movq [r2], xm3 + movhps [r2 + r3], xm3 + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_6x8, 4, 6, 7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m1, [tab_Tm] + mova m2, [pw_1] + mova m6, [pw_512] + lea r4, [r1 * 3] + lea r5, [r3 * 3] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 +%rep 2 + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + ; Row 2 + vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + ; Row 3 + vbroadcasti128 m5, [r0 + r4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vextracti128 xm4, m3, 1 + movd [r2], xm3 + pextrw [r2 + 4], xm4, 0 + pextrd [r2 + r3], xm3, 1 + pextrw [r2 + r3 + 4], xm4, 2 + pextrd [r2 + r3 * 2], xm3, 2 + pextrw [r2 + r3 * 2 + 4], xm4, 4 + pextrd [r2 + r5], xm3, 3 + pextrw [r2 + r5 + 4], xm4, 6 + lea r2, [r2 + r3 * 4] + lea r0, [r0 + r1 * 4] +%endrep + RET + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_64xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;-----------------------------------------------------------------------------------------------------------------------------; +%macro IPFILTER_CHROMA_HPS_64xN 1 +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_64x%1, 4,7,6 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m5, [pw_2000] + mova m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + mov r6d, %1 + dec r0 + test r5d, r5d + je .loop + sub r0 , r1 + add r6d , 3 + +.loop + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2], m3 + + vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 24] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2 + 32], m3 + + vbroadcasti128 m3, [r0 + 32] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 40] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2 + 64], m3 + + vbroadcasti128 m3, [r0 + 48] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 56] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2 + 96], m3 + + add r2, r3 + add r0, r1 + dec r6d + jnz .loop + RET +%endmacro + + IPFILTER_CHROMA_HPS_64xN 64 + IPFILTER_CHROMA_HPS_64xN 32 + IPFILTER_CHROMA_HPS_64xN 48 + IPFILTER_CHROMA_HPS_64xN 16 + +;----------------------------------------------------------------------------------------------------------------------------- +;void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- + +%macro IPFILTER_LUMA_PS_4xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_4x%1, 6,7,6 + mov r5d, r5m + mov r4d, r4m +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + mova m1, [tab_Lm] + add r3d, r3d + vbroadcasti128 m2, [pw_2000] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - pw_2000 + + sub r0, 3 + test r5d, r5d + mov r5d, %1 ; loop count variable - height + jz .preloop + lea r6, [r1 * 3] ; r8 = (N / 2 - 1) * srcStride + sub r0, r6 ; r0(src) - 3 * srcStride + add r5d, 7 ; need extra 7 rows, just set a specially flag here, blkheight += N - 1 (7 - 3 = 4 ; since the last three rows not in loop) + +.preloop: + lea r6, [r3 * 3] +.loop + ; Row 0-1 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 ; shuffled based on the col order tab_Lm + pmaddubsw m3, m0 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + phaddw m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] + + ; Row 2-3 + lea r0, [r0 + r1 * 2] ;3rd row(i.e 2nd row) + vbroadcasti128 m4, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + vbroadcasti128 m5, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + phaddw m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] + phaddw m3, m4 ; all rows and col completed. + + mova m5, [interp8_hps_shuf] + vpermd m3, m5, m3 + psubw m3, m2 + + vextracti128 xm4, m3, 1 + movq [r2], xm3 ;row 0 + movhps [r2 + r3], xm3 ;row 1 + movq [r2 + r3 * 2], xm4 ;row 2 + movhps [r2 + r6], xm4 ;row 3 + + lea r0, [r0 + r1 * 2] ; first loop src ->5th row(i.e 4) + lea r2, [r2 + r3 * 4] ; first loop dst ->5th row(i.e 4) + sub r5d, 4 + jz .end + cmp r5d, 4 + jge .loop + + ; Row 8-9 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + phaddw m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] + + ; Row 10 + vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + phaddw m4, m4 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] + phaddw m3, m4 + + vpermd m3, m5, m3 ; m5 don't broken in above + psubw m3, m2 + + vextracti128 xm4, m3, 1 + movq [r2], xm3 + movhps [r2 + r3], xm3 + movq [r2 + r3 * 2], xm4 +.end + RET +%endif +%endmacro + + IPFILTER_LUMA_PS_4xN_AVX2 4 + IPFILTER_LUMA_PS_4xN_AVX2 8 + IPFILTER_LUMA_PS_4xN_AVX2 16 + +%macro IPFILTER_LUMA_PS_8xN_AVX2 1 +; TODO: verify and enable on X86 mode +%if ARCH_X86_64 == 1 +; void filter_hps(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) +INIT_YMM avx2 +cglobal interp_8tap_horiz_ps_8x%1, 4,7,6 + mov r5d, r5m + mov r4d, r4m + shl r4d, 7 +%ifdef PIC + lea r6, [pb_LumaCoeffVer] + add r6, r4 +%else + lea r6, [pb_LumaCoeffVer + r4] +%endif + add r3d, r3d + vpbroadcastd m0, [pw_2000] + sub r0, 3 + lea r4, [pb_8tap_hps_0] + vbroadcasti128 m5, [r4 + 0 * mmsize] + + ; check row count extend for interpolateHV + test r5d, r5d; + mov r5d, %1 + jz .enter_loop + lea r4, [r1 * 3] ; r8 = (N / 2 - 1) * srcStride + sub r0, r4 ; r0(src)-r8 + add r5d, 8-1-2 ; blkheight += N - 1 (7 - 3 = 4 ; since the last three rows not in loop) + +.enter_loop: + lea r4, [pb_8tap_hps_0] + + ; ***** register map ***** + ; m0 - pw_2000 + ; r4 - base pointer of shuffle order table + ; r5 - count of loop + ; r6 - point to LumaCoeff +.loop: + + ; Row 0-1 + movu xm1, [r0] + movu xm2, [r0 + r1] + vinserti128 m1, m1, xm2, 1 + pshufb m2, m1, m5 ; [0 1 1 2 2 3 3 4 ...] + pshufb m3, m1, [r4 + 1 * mmsize] ; [2 3 3 4 4 5 5 6 ...] + pshufb m4, m1, [r4 + 2 * mmsize] ; [4 5 5 6 6 7 7 8 ...] + pshufb m1, m1, [r4 + 3 * mmsize] ; [6 7 7 8 8 9 9 A ...] + pmaddubsw m2, [r6 + 0 * mmsize] + pmaddubsw m3, [r6 + 1 * mmsize] + pmaddubsw m4, [r6 + 2 * mmsize] + pmaddubsw m1, [r6 + 3 * mmsize] + paddw m2, m3 + paddw m1, m4 + paddw m1, m2 + psubw m1, m0 + + vextracti128 xm2, m1, 1 + movu [r2], xm1 ; row 0 + movu [r2 + r3], xm2 ; row 1 + + lea r0, [r0 + r1 * 2] ; first loop src ->5th row(i.e 4) + lea r2, [r2 + r3 * 2] ; first loop dst ->5th row(i.e 4) + sub r5d, 2 + jg .loop + jz .end + + ; last row + movu xm1, [r0] + pshufb xm2, xm1, xm5 ; [0 1 1 2 2 3 3 4 ...] + pshufb xm3, xm1, [r4 + 1 * mmsize] ; [2 3 3 4 4 5 5 6 ...] + pshufb xm4, xm1, [r4 + 2 * mmsize] ; [4 5 5 6 6 7 7 8 ...] + pshufb xm1, xm1, [r4 + 3 * mmsize] ; [6 7 7 8 8 9 9 A ...] + pmaddubsw xm2, [r6 + 0 * mmsize] + pmaddubsw xm3, [r6 + 1 * mmsize] + pmaddubsw xm4, [r6 + 2 * mmsize] + pmaddubsw xm1, [r6 + 3 * mmsize] + paddw xm2, xm3 + paddw xm1, xm4 + paddw xm1, xm2 + psubw xm1, xm0 + movu [r2], xm1 ;row 0 +.end + RET +%endif +%endmacro ; IPFILTER_LUMA_PS_8xN_AVX2 + + IPFILTER_LUMA_PS_8xN_AVX2 4 + IPFILTER_LUMA_PS_8xN_AVX2 8 + IPFILTER_LUMA_PS_8xN_AVX2 16 + IPFILTER_LUMA_PS_8xN_AVX2 32 + + +%macro IPFILTER_LUMA_PS_16x_AVX2 2 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_%1x%2, 6, 10, 7 + mov r5d, r5m + mov r4d, r4m +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + mova m6, [tab_Lm + 32] + mova m1, [tab_Lm] + mov r9, %2 ;height + add r3d, r3d + vbroadcasti128 m2, [pw_2000] + + ; register map + ; m0 - interpolate coeff + ; m1 , m6 - shuffle order table + ; m2 - pw_2000 + + xor r7, r7 ; loop count variable + sub r0, 3 + test r5d, r5d + jz .label + lea r8, [r1 * 3] ; r8 = (N / 2 - 1) * srcStride + sub r0, r8 ; r0(src)-r8 + add r9, 7 ; blkheight += N - 1 (7 - 1 = 6 ; since the last one row not in loop) + +.label + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m3, m6 ; row 0 (col 4 to 7) + pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3) + pmaddubsw m3, m0 + pmaddubsw m4, m0 + phaddw m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] + + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m4, m6 ;row 1 (col 4 to 7) + pshufb m4, m1 ;row 1 (col 0 to 3) + pmaddubsw m4, m0 + pmaddubsw m5, m0 + phaddw m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] + phaddw m3, m4 ; all rows and col completed. + + mova m5, [interp8_hps_shuf] + vpermd m3, m5, m3 + psubw m3, m2 + + movu [r2], m3 ;row 0 + + lea r0, [r0 + r1] ; first loop src ->5th row(i.e 4) + lea r2, [r2 + r3] ; first loop dst ->5th row(i.e 4) + dec r9d + jnz .label + + RET +%endif +%endmacro + + + IPFILTER_LUMA_PS_16x_AVX2 16 , 16 + IPFILTER_LUMA_PS_16x_AVX2 16 , 8 + IPFILTER_LUMA_PS_16x_AVX2 16 , 12 + IPFILTER_LUMA_PS_16x_AVX2 16 , 4 + IPFILTER_LUMA_PS_16x_AVX2 16 , 32 + IPFILTER_LUMA_PS_16x_AVX2 16 , 64 + + +;-------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;-------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_LUMA_PP_W8 2 +INIT_XMM sse4 +cglobal interp_8tap_horiz_pp_%1x%2, 4,6,7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_LumaCoeff] + movh m3, [r5 + r4 * 8] +%else + movh m3, [tab_LumaCoeff + r4 * 8] +%endif + pshufd m0, m3, 0 ; m0 = coeff-L + pshufd m1, m3, 0x55 ; m1 = coeff-H + lea r5, [tab_Tm] ; r5 = shuffle + mova m2, [pw_512] ; m2 = 512 + + mov r4d, %2 +.loopH: +%assign x 0 +%rep %1 / 8 + movu m3, [r0 - 3 + x] ; m3 = [F E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m3, [r5 + 0*16] ; m4 = [6 5 4 3 5 4 3 2 4 3 2 1 3 2 1 0] + pshufb m5, m3, [r5 + 1*16] ; m5 = [A 9 8 7 9 8 7 6 8 7 6 5 7 6 5 4] + pshufb m3, [r5 + 2*16] ; m3 = [E D C B D C B A C B A 9 B A 9 8] + pmaddubsw m4, m0 + pmaddubsw m6, m5, m1 + pmaddubsw m5, m0 + pmaddubsw m3, m1 + paddw m4, m6 + paddw m5, m3 + phaddw m4, m5 + pmulhrsw m4, m2 + packuswb m4, m4 + movh [r2 + x], m4 +%assign x x+8 +%endrep + + add r0, r1 + add r2, r3 + + dec r4d + jnz .loopH + RET +%endmacro + + IPFILTER_LUMA_PP_W8 8, 4 + IPFILTER_LUMA_PP_W8 8, 8 + IPFILTER_LUMA_PP_W8 8, 16 + IPFILTER_LUMA_PP_W8 8, 32 + IPFILTER_LUMA_PP_W8 16, 4 + IPFILTER_LUMA_PP_W8 16, 8 + IPFILTER_LUMA_PP_W8 16, 12 + IPFILTER_LUMA_PP_W8 16, 16 + IPFILTER_LUMA_PP_W8 16, 32 + IPFILTER_LUMA_PP_W8 16, 64 + IPFILTER_LUMA_PP_W8 24, 32 + IPFILTER_LUMA_PP_W8 32, 8 + IPFILTER_LUMA_PP_W8 32, 16 + IPFILTER_LUMA_PP_W8 32, 24 + IPFILTER_LUMA_PP_W8 32, 32 + IPFILTER_LUMA_PP_W8 32, 64 + IPFILTER_LUMA_PP_W8 48, 64 + IPFILTER_LUMA_PP_W8 64, 16 + IPFILTER_LUMA_PP_W8 64, 32 + IPFILTER_LUMA_PP_W8 64, 48 + IPFILTER_LUMA_PP_W8 64, 64 + +;---------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;---------------------------------------------------------------------------------------------------------------------------- + IPFILTER_LUMA 4, 4, ps + IPFILTER_LUMA 8, 8, ps + IPFILTER_LUMA 8, 4, ps + IPFILTER_LUMA 4, 8, ps + IPFILTER_LUMA 16, 16, ps + IPFILTER_LUMA 16, 8, ps + IPFILTER_LUMA 8, 16, ps + IPFILTER_LUMA 16, 12, ps + IPFILTER_LUMA 12, 16, ps + IPFILTER_LUMA 16, 4, ps + IPFILTER_LUMA 4, 16, ps + IPFILTER_LUMA 32, 32, ps + IPFILTER_LUMA 32, 16, ps + IPFILTER_LUMA 16, 32, ps + IPFILTER_LUMA 32, 24, ps + IPFILTER_LUMA 24, 32, ps + IPFILTER_LUMA 32, 8, ps + IPFILTER_LUMA 8, 32, ps + IPFILTER_LUMA 64, 64, ps + IPFILTER_LUMA 64, 32, ps + IPFILTER_LUMA 32, 64, ps + IPFILTER_LUMA 64, 48, ps + IPFILTER_LUMA 48, 64, ps + IPFILTER_LUMA 64, 16, ps + IPFILTER_LUMA 16, 64, ps + ;----------------------------------------------------------------------------- ; Interpolate HV ;----------------------------------------------------------------------------- @@ -999,6 +5975,6259 @@ RET ;----------------------------------------------------------------------------- +;void interp_4tap_vert_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal interp_4tap_vert_pp_2x4, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + lea r4, [r1 * 3] + lea r5, [r0 + 4 * r1] + pshufb m0, [tab_Cm] + mova m1, [pw_512] + + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r4] + + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklbw m2, m6 + + pmaddubsw m2, m0 + + movd m6, [r5] + + punpcklbw m3, m4 + punpcklbw m7, m5, m6 + punpcklbw m3, m7 + + pmaddubsw m3, m0 + + phaddw m2, m3 + + pmulhrsw m2, m1 + + movd m7, [r5 + r1] + + punpcklbw m4, m5 + punpcklbw m3, m6, m7 + punpcklbw m4, m3 + + pmaddubsw m4, m0 + + movd m3, [r5 + 2 * r1] + + punpcklbw m5, m6 + punpcklbw m7, m3 + punpcklbw m5, m7 + + pmaddubsw m5, m0 + + phaddw m4, m5 + + pmulhrsw m4, m1 + packuswb m2, m4 + + pextrw [r2], m2, 0 + pextrw [r2 + r3], m2, 2 + lea r2, [r2 + 2 * r3] + pextrw [r2], m2, 4 + pextrw [r2 + r3], m2, 6 + + RET + +%macro FILTER_VER_CHROMA_AVX2_2x4 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_2x4, 4, 6, 2 + mov r4d, r4m + shl r4d, 5 + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeff_V] + add r5, r4 +%else + lea r5, [tab_ChromaCoeff_V + r4] +%endif + + lea r4, [r1 * 3] + + pinsrw xm1, [r0], 0 + pinsrw xm1, [r0 + r1], 1 + pinsrw xm1, [r0 + r1 * 2], 2 + pinsrw xm1, [r0 + r4], 3 + lea r0, [r0 + r1 * 4] + pinsrw xm1, [r0], 4 + pinsrw xm1, [r0 + r1], 5 + pinsrw xm1, [r0 + r1 * 2], 6 + + pshufb xm0, xm1, [interp_vert_shuf] + pshufb xm1, [interp_vert_shuf + 32] + vinserti128 m0, m0, xm1, 1 + pmaddubsw m0, [r5] + vextracti128 xm1, m0, 1 + paddw xm0, xm1 +%ifidn %1,pp + pmulhrsw xm0, [pw_512] + packuswb xm0, xm0 + lea r4, [r3 * 3] + pextrw [r2], xm0, 0 + pextrw [r2 + r3], xm0, 1 + pextrw [r2 + r3 * 2], xm0, 2 + pextrw [r2 + r4], xm0, 3 +%else + add r3d, r3d + lea r4, [r3 * 3] + psubw xm0, [pw_2000] + movd [r2], xm0 + pextrd [r2 + r3], xm0, 1 + pextrd [r2 + r3 * 2], xm0, 2 + pextrd [r2 + r4], xm0, 3 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_2x4 pp + FILTER_VER_CHROMA_AVX2_2x4 ps + +%macro FILTER_VER_CHROMA_AVX2_2x8 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_2x8, 4, 6, 2 + mov r4d, r4m + shl r4d, 6 + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + + pinsrw xm1, [r0], 0 + pinsrw xm1, [r0 + r1], 1 + pinsrw xm1, [r0 + r1 * 2], 2 + pinsrw xm1, [r0 + r4], 3 + lea r0, [r0 + r1 * 4] + pinsrw xm1, [r0], 4 + pinsrw xm1, [r0 + r1], 5 + pinsrw xm1, [r0 + r1 * 2], 6 + pinsrw xm1, [r0 + r4], 7 + movhlps xm0, xm1 + lea r0, [r0 + r1 * 4] + pinsrw xm0, [r0], 4 + pinsrw xm0, [r0 + r1], 5 + pinsrw xm0, [r0 + r1 * 2], 6 + vinserti128 m1, m1, xm0, 1 + + pshufb m0, m1, [interp_vert_shuf] + pshufb m1, [interp_vert_shuf + 32] + pmaddubsw m0, [r5] + pmaddubsw m1, [r5 + 1 * mmsize] + paddw m0, m1 +%ifidn %1,pp + pmulhrsw m0, [pw_512] + vextracti128 xm1, m0, 1 + packuswb xm0, xm1 + lea r4, [r3 * 3] + pextrw [r2], xm0, 0 + pextrw [r2 + r3], xm0, 1 + pextrw [r2 + r3 * 2], xm0, 2 + pextrw [r2 + r4], xm0, 3 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm0, 4 + pextrw [r2 + r3], xm0, 5 + pextrw [r2 + r3 * 2], xm0, 6 + pextrw [r2 + r4], xm0, 7 +%else + add r3d, r3d + lea r4, [r3 * 3] + psubw m0, [pw_2000] + vextracti128 xm1, m0, 1 + movd [r2], xm0 + pextrd [r2 + r3], xm0, 1 + pextrd [r2 + r3 * 2], xm0, 2 + pextrd [r2 + r4], xm0, 3 + lea r2, [r2 + r3 * 4] + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + pextrd [r2 + r3 * 2], xm1, 2 + pextrd [r2 + r4], xm1, 3 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_2x8 pp + FILTER_VER_CHROMA_AVX2_2x8 ps + +%macro FILTER_VER_CHROMA_AVX2_2x16 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_2x16, 4, 6, 3 + mov r4d, r4m + shl r4d, 6 + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + + movd xm1, [r0] + pinsrw xm1, [r0 + r1], 1 + pinsrw xm1, [r0 + r1 * 2], 2 + pinsrw xm1, [r0 + r4], 3 + lea r0, [r0 + r1 * 4] + pinsrw xm1, [r0], 4 + pinsrw xm1, [r0 + r1], 5 + pinsrw xm1, [r0 + r1 * 2], 6 + pinsrw xm1, [r0 + r4], 7 + lea r0, [r0 + r1 * 4] + pinsrw xm0, [r0], 4 + pinsrw xm0, [r0 + r1], 5 + pinsrw xm0, [r0 + r1 * 2], 6 + pinsrw xm0, [r0 + r4], 7 + punpckhqdq xm0, xm1, xm0 + vinserti128 m1, m1, xm0, 1 + + pshufb m2, m1, [interp_vert_shuf] + pshufb m1, [interp_vert_shuf + 32] + pmaddubsw m2, [r5] + pmaddubsw m1, [r5 + 1 * mmsize] + paddw m2, m1 + + lea r0, [r0 + r1 * 4] + pinsrw xm1, [r0], 4 + pinsrw xm1, [r0 + r1], 5 + pinsrw xm1, [r0 + r1 * 2], 6 + pinsrw xm1, [r0 + r4], 7 + punpckhqdq xm1, xm0, xm1 + lea r0, [r0 + r1 * 4] + pinsrw xm0, [r0], 4 + pinsrw xm0, [r0 + r1], 5 + pinsrw xm0, [r0 + r1 * 2], 6 + punpckhqdq xm0, xm1, xm0 + vinserti128 m1, m1, xm0, 1 + + pshufb m0, m1, [interp_vert_shuf] + pshufb m1, [interp_vert_shuf + 32] + pmaddubsw m0, [r5] + pmaddubsw m1, [r5 + 1 * mmsize] + paddw m0, m1 +%ifidn %1,pp + mova m1, [pw_512] + pmulhrsw m2, m1 + pmulhrsw m0, m1 + packuswb m2, m0 + lea r4, [r3 * 3] + pextrw [r2], xm2, 0 + pextrw [r2 + r3], xm2, 1 + pextrw [r2 + r3 * 2], xm2, 2 + pextrw [r2 + r4], xm2, 3 + vextracti128 xm0, m2, 1 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm0, 0 + pextrw [r2 + r3], xm0, 1 + pextrw [r2 + r3 * 2], xm0, 2 + pextrw [r2 + r4], xm0, 3 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm2, 4 + pextrw [r2 + r3], xm2, 5 + pextrw [r2 + r3 * 2], xm2, 6 + pextrw [r2 + r4], xm2, 7 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm0, 4 + pextrw [r2 + r3], xm0, 5 + pextrw [r2 + r3 * 2], xm0, 6 + pextrw [r2 + r4], xm0, 7 +%else + add r3d, r3d + lea r4, [r3 * 3] + vbroadcasti128 m1, [pw_2000] + psubw m2, m1 + psubw m0, m1 + vextracti128 xm1, m2, 1 + movd [r2], xm2 + pextrd [r2 + r3], xm2, 1 + pextrd [r2 + r3 * 2], xm2, 2 + pextrd [r2 + r4], xm2, 3 + lea r2, [r2 + r3 * 4] + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + pextrd [r2 + r3 * 2], xm1, 2 + pextrd [r2 + r4], xm1, 3 + vextracti128 xm1, m0, 1 + lea r2, [r2 + r3 * 4] + movd [r2], xm0 + pextrd [r2 + r3], xm0, 1 + pextrd [r2 + r3 * 2], xm0, 2 + pextrd [r2 + r4], xm0, 3 + lea r2, [r2 + r3 * 4] + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + pextrd [r2 + r3 * 2], xm1, 2 + pextrd [r2 + r4], xm1, 3 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_2x16 pp + FILTER_VER_CHROMA_AVX2_2x16 ps + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W2_H4 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_pp_2x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m0, [tab_Cm] + + mova m1, [pw_512] + + mov r4d, %2 + lea r5, [3 * r1] + +.loop: + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r5] + + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklbw m2, m6 + + pmaddubsw m2, m0 + + lea r0, [r0 + 4 * r1] + movd m6, [r0] + + punpcklbw m3, m4 + punpcklbw m7, m5, m6 + punpcklbw m3, m7 + + pmaddubsw m3, m0 + + phaddw m2, m3 + + pmulhrsw m2, m1 + + movd m7, [r0 + r1] + + punpcklbw m4, m5 + punpcklbw m3, m6, m7 + punpcklbw m4, m3 + + pmaddubsw m4, m0 + + movd m3, [r0 + 2 * r1] + + punpcklbw m5, m6 + punpcklbw m7, m3 + punpcklbw m5, m7 + + pmaddubsw m5, m0 + + phaddw m4, m5 + + pmulhrsw m4, m1 + packuswb m2, m4 + + pextrw [r2], m2, 0 + pextrw [r2 + r3], m2, 2 + lea r2, [r2 + 2 * r3] + pextrw [r2], m2, 4 + pextrw [r2 + r3], m2, 6 + + lea r2, [r2 + 2 * r3] + + sub r4, 4 + jnz .loop + RET +%endmacro + + FILTER_V4_W2_H4 2, 8 + + FILTER_V4_W2_H4 2, 16 + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal interp_4tap_vert_pp_4x2, 4, 6, 6 + + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m0, [tab_Cm] + lea r5, [r0 + 2 * r1] + + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r5] + movd m5, [r5 + r1] + + punpcklbw m2, m3 + punpcklbw m1, m4, m5 + punpcklbw m2, m1 + + pmaddubsw m2, m0 + + movd m1, [r0 + 4 * r1] + + punpcklbw m3, m4 + punpcklbw m5, m1 + punpcklbw m3, m5 + + pmaddubsw m3, m0 + + phaddw m2, m3 + + pmulhrsw m2, [pw_512] + packuswb m2, m2 + movd [r2], m2 + pextrd [r2 + r3], m2, 1 + + RET + +%macro FILTER_VER_CHROMA_AVX2_4x2 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x2, 4, 6, 4 + mov r4d, r4m + shl r4d, 5 + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeff_V] + add r5, r4 +%else + lea r5, [tab_ChromaCoeff_V + r4] +%endif + + lea r4, [r1 * 3] + + movd xm1, [r0] + movd xm2, [r0 + r1] + punpcklbw xm1, xm2 + movd xm3, [r0 + r1 * 2] + punpcklbw xm2, xm3 + movlhps xm1, xm2 + movd xm0, [r0 + r4] + punpcklbw xm3, xm0 + movd xm2, [r0 + r1 * 4] + punpcklbw xm0, xm2 + movlhps xm3, xm0 + vinserti128 m1, m1, xm3, 1 ; m1 = row[x x x 4 3 2 1 0] + + pmaddubsw m1, [r5] + vextracti128 xm3, m1, 1 + paddw xm1, xm3 +%ifidn %1,pp + pmulhrsw xm1, [pw_512] + packuswb xm1, xm1 + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 +%else + add r3d, r3d + psubw xm1, [pw_2000] + movq [r2], xm1 + movhps [r2 + r3], xm1 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_4x2 pp + FILTER_VER_CHROMA_AVX2_4x2 ps + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal interp_4tap_vert_pp_4x4, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m0, [tab_Cm] + mova m1, [pw_512] + lea r5, [r0 + 4 * r1] + lea r4, [r1 * 3] + + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r4] + + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklbw m2, m6 + + pmaddubsw m2, m0 + + movd m6, [r5] + + punpcklbw m3, m4 + punpcklbw m7, m5, m6 + punpcklbw m3, m7 + + pmaddubsw m3, m0 + + phaddw m2, m3 + + pmulhrsw m2, m1 + + movd m7, [r5 + r1] + + punpcklbw m4, m5 + punpcklbw m3, m6, m7 + punpcklbw m4, m3 + + pmaddubsw m4, m0 + + movd m3, [r5 + 2 * r1] + + punpcklbw m5, m6 + punpcklbw m7, m3 + punpcklbw m5, m7 + + pmaddubsw m5, m0 + + phaddw m4, m5 + + pmulhrsw m4, m1 + + packuswb m2, m4 + movd [r2], m2 + pextrd [r2 + r3], m2, 1 + lea r2, [r2 + 2 * r3] + pextrd [r2], m2, 2 + pextrd [r2 + r3], m2, 3 + RET +%macro FILTER_VER_CHROMA_AVX2_4x4 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x4, 4, 6, 3 + mov r4d, r4m + shl r4d, 6 + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + + movd xm1, [r0] + pinsrd xm1, [r0 + r1], 1 + pinsrd xm1, [r0 + r1 * 2], 2 + pinsrd xm1, [r0 + r4], 3 ; m1 = row[3 2 1 0] + lea r0, [r0 + r1 * 4] + movd xm2, [r0] + pinsrd xm2, [r0 + r1], 1 + pinsrd xm2, [r0 + r1 * 2], 2 ; m2 = row[x 6 5 4] + vinserti128 m1, m1, xm2, 1 ; m1 = row[x 6 5 4 3 2 1 0] + mova m2, [interp4_vpp_shuf1] + vpermd m0, m2, m1 ; m0 = row[4 3 3 2 2 1 1 0] + mova m2, [interp4_vpp_shuf1 + mmsize] + vpermd m1, m2, m1 ; m1 = row[6 5 5 4 4 3 3 2] + + mova m2, [interp4_vpp_shuf] + pshufb m0, m0, m2 + pshufb m1, m1, m2 + pmaddubsw m0, [r5] + pmaddubsw m1, [r5 + mmsize] + paddw m0, m1 ; m0 = WORD ROW[3 2 1 0] +%ifidn %1,pp + pmulhrsw m0, [pw_512] + vextracti128 xm1, m0, 1 + packuswb xm0, xm1 + lea r5, [r3 * 3] + movd [r2], xm0 + pextrd [r2 + r3], xm0, 1 + pextrd [r2 + r3 * 2], xm0, 2 + pextrd [r2 + r5], xm0, 3 +%else + add r3d, r3d + psubw m0, [pw_2000] + vextracti128 xm1, m0, 1 + lea r5, [r3 * 3] + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm1 + movhps [r2 + r5], xm1 +%endif + RET +%endmacro + FILTER_VER_CHROMA_AVX2_4x4 pp + FILTER_VER_CHROMA_AVX2_4x4 ps + +%macro FILTER_VER_CHROMA_AVX2_4x8 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x8, 4, 6, 5 + mov r4d, r4m + shl r4d, 6 + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + + movd xm1, [r0] + pinsrd xm1, [r0 + r1], 1 + pinsrd xm1, [r0 + r1 * 2], 2 + pinsrd xm1, [r0 + r4], 3 ; m1 = row[3 2 1 0] + lea r0, [r0 + r1 * 4] + movd xm2, [r0] + pinsrd xm2, [r0 + r1], 1 + pinsrd xm2, [r0 + r1 * 2], 2 + pinsrd xm2, [r0 + r4], 3 ; m2 = row[7 6 5 4] + vinserti128 m1, m1, xm2, 1 ; m1 = row[7 6 5 4 3 2 1 0] + lea r0, [r0 + r1 * 4] + movd xm3, [r0] + pinsrd xm3, [r0 + r1], 1 + pinsrd xm3, [r0 + r1 * 2], 2 ; m3 = row[x 10 9 8] + vinserti128 m2, m2, xm3, 1 ; m2 = row[x 10 9 8 7 6 5 4] + mova m3, [interp4_vpp_shuf1] + vpermd m0, m3, m1 ; m0 = row[4 3 3 2 2 1 1 0] + vpermd m4, m3, m2 ; m4 = row[8 7 7 6 6 5 5 4] + mova m3, [interp4_vpp_shuf1 + mmsize] + vpermd m1, m3, m1 ; m1 = row[6 5 5 4 4 3 3 2] + vpermd m2, m3, m2 ; m2 = row[10 9 9 8 8 7 7 6] + + mova m3, [interp4_vpp_shuf] + pshufb m0, m0, m3 + pshufb m1, m1, m3 + pshufb m2, m2, m3 + pshufb m4, m4, m3 + pmaddubsw m0, [r5] + pmaddubsw m4, [r5] + pmaddubsw m1, [r5 + mmsize] + pmaddubsw m2, [r5 + mmsize] + paddw m0, m1 ; m0 = WORD ROW[3 2 1 0] + paddw m4, m2 ; m4 = WORD ROW[7 6 5 4] +%ifidn %1,pp + pmulhrsw m0, [pw_512] + pmulhrsw m4, [pw_512] + packuswb m0, m4 + vextracti128 xm1, m0, 1 + lea r5, [r3 * 3] + movd [r2], xm0 + pextrd [r2 + r3], xm0, 1 + movd [r2 + r3 * 2], xm1 + pextrd [r2 + r5], xm1, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm0, 2 + pextrd [r2 + r3], xm0, 3 + pextrd [r2 + r3 * 2], xm1, 2 + pextrd [r2 + r5], xm1, 3 +%else + add r3d, r3d + psubw m0, [pw_2000] + psubw m4, [pw_2000] + vextracti128 xm1, m0, 1 + vextracti128 xm2, m4, 1 + lea r5, [r3 * 3] + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm1 + movhps [r2 + r5], xm1 + lea r2, [r2 + r3 * 4] + movq [r2], xm4 + movhps [r2 + r3], xm4 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r5], xm2 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_4x8 pp + FILTER_VER_CHROMA_AVX2_4x8 ps + +%macro FILTER_VER_CHROMA_AVX2_4xN 2 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x%2, 4, 6, 12 + mov r4d, r4m + shl r4d, 6 + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + mova m10, [r5] + mova m11, [r5 + mmsize] +%ifidn %1,pp + mova m9, [pw_512] +%else + add r3d, r3d + mova m9, [pw_2000] +%endif + lea r5, [r3 * 3] +%rep %2 / 16 + movd xm1, [r0] + pinsrd xm1, [r0 + r1], 1 + pinsrd xm1, [r0 + r1 * 2], 2 + pinsrd xm1, [r0 + r4], 3 ; m1 = row[3 2 1 0] + lea r0, [r0 + r1 * 4] + movd xm2, [r0] + pinsrd xm2, [r0 + r1], 1 + pinsrd xm2, [r0 + r1 * 2], 2 + pinsrd xm2, [r0 + r4], 3 ; m2 = row[7 6 5 4] + vinserti128 m1, m1, xm2, 1 ; m1 = row[7 6 5 4 3 2 1 0] + lea r0, [r0 + r1 * 4] + movd xm3, [r0] + pinsrd xm3, [r0 + r1], 1 + pinsrd xm3, [r0 + r1 * 2], 2 + pinsrd xm3, [r0 + r4], 3 ; m3 = row[11 10 9 8] + vinserti128 m2, m2, xm3, 1 ; m2 = row[11 10 9 8 7 6 5 4] + lea r0, [r0 + r1 * 4] + movd xm4, [r0] + pinsrd xm4, [r0 + r1], 1 + pinsrd xm4, [r0 + r1 * 2], 2 + pinsrd xm4, [r0 + r4], 3 ; m4 = row[15 14 13 12] + vinserti128 m3, m3, xm4, 1 ; m3 = row[15 14 13 12 11 10 9 8] + lea r0, [r0 + r1 * 4] + movd xm5, [r0] + pinsrd xm5, [r0 + r1], 1 + pinsrd xm5, [r0 + r1 * 2], 2 ; m5 = row[x 18 17 16] + vinserti128 m4, m4, xm5, 1 ; m4 = row[x 18 17 16 15 14 13 12] + mova m5, [interp4_vpp_shuf1] + vpermd m0, m5, m1 ; m0 = row[4 3 3 2 2 1 1 0] + vpermd m6, m5, m2 ; m6 = row[8 7 7 6 6 5 5 4] + vpermd m7, m5, m3 ; m7 = row[12 11 11 10 10 9 9 8] + vpermd m8, m5, m4 ; m8 = row[16 15 15 14 14 13 13 12] + mova m5, [interp4_vpp_shuf1 + mmsize] + vpermd m1, m5, m1 ; m1 = row[6 5 5 4 4 3 3 2] + vpermd m2, m5, m2 ; m2 = row[10 9 9 8 8 7 7 6] + vpermd m3, m5, m3 ; m3 = row[14 13 13 12 12 11 11 10] + vpermd m4, m5, m4 ; m4 = row[18 17 17 16 16 15 15 14] + + mova m5, [interp4_vpp_shuf] + pshufb m0, m0, m5 + pshufb m1, m1, m5 + pshufb m2, m2, m5 + pshufb m4, m4, m5 + pshufb m3, m3, m5 + pshufb m6, m6, m5 + pshufb m7, m7, m5 + pshufb m8, m8, m5 + pmaddubsw m0, m10 + pmaddubsw m6, m10 + pmaddubsw m7, m10 + pmaddubsw m8, m10 + pmaddubsw m1, m11 + pmaddubsw m2, m11 + pmaddubsw m3, m11 + pmaddubsw m4, m11 + paddw m0, m1 ; m0 = WORD ROW[3 2 1 0] + paddw m6, m2 ; m6 = WORD ROW[7 6 5 4] + paddw m7, m3 ; m7 = WORD ROW[11 10 9 8] + paddw m8, m4 ; m8 = WORD ROW[15 14 13 12] +%ifidn %1,pp + pmulhrsw m0, m9 + pmulhrsw m6, m9 + pmulhrsw m7, m9 + pmulhrsw m8, m9 + packuswb m0, m6 + packuswb m7, m8 + vextracti128 xm1, m0, 1 + vextracti128 xm2, m7, 1 + movd [r2], xm0 + pextrd [r2 + r3], xm0, 1 + movd [r2 + r3 * 2], xm1 + pextrd [r2 + r5], xm1, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm0, 2 + pextrd [r2 + r3], xm0, 3 + pextrd [r2 + r3 * 2], xm1, 2 + pextrd [r2 + r5], xm1, 3 + lea r2, [r2 + r3 * 4] + movd [r2], xm7 + pextrd [r2 + r3], xm7, 1 + movd [r2 + r3 * 2], xm2 + pextrd [r2 + r5], xm2, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm7, 2 + pextrd [r2 + r3], xm7, 3 + pextrd [r2 + r3 * 2], xm2, 2 + pextrd [r2 + r5], xm2, 3 +%else + psubw m0, m9 + psubw m6, m9 + psubw m7, m9 + psubw m8, m9 + vextracti128 xm1, m0, 1 + vextracti128 xm2, m6, 1 + vextracti128 xm3, m7, 1 + vextracti128 xm4, m8, 1 + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm1 + movhps [r2 + r5], xm1 + lea r2, [r2 + r3 * 4] + movq [r2], xm6 + movhps [r2 + r3], xm6 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r5], xm2 + lea r2, [r2 + r3 * 4] + movq [r2], xm7 + movhps [r2 + r3], xm7 + movq [r2 + r3 * 2], xm3 + movhps [r2 + r5], xm3 + lea r2, [r2 + r3 * 4] + movq [r2], xm8 + movhps [r2 + r3], xm8 + movq [r2 + r3 * 2], xm4 + movhps [r2 + r5], xm4 +%endif + lea r2, [r2 + r3 * 4] +%endrep + RET +%endif +%endmacro + + FILTER_VER_CHROMA_AVX2_4xN pp, 16 + FILTER_VER_CHROMA_AVX2_4xN ps, 16 + FILTER_VER_CHROMA_AVX2_4xN pp, 32 + FILTER_VER_CHROMA_AVX2_4xN ps, 32 + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W4_H4 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m0, [tab_Cm] + + mova m1, [pw_512] + + mov r4d, %2 + + lea r5, [3 * r1] + +.loop: + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r5] + + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklbw m2, m6 + + pmaddubsw m2, m0 + + lea r0, [r0 + 4 * r1] + movd m6, [r0] + + punpcklbw m3, m4 + punpcklbw m7, m5, m6 + punpcklbw m3, m7 + + pmaddubsw m3, m0 + + phaddw m2, m3 + + pmulhrsw m2, m1 + + movd m7, [r0 + r1] + + punpcklbw m4, m5 + punpcklbw m3, m6, m7 + punpcklbw m4, m3 + + pmaddubsw m4, m0 + + movd m3, [r0 + 2 * r1] + + punpcklbw m5, m6 + punpcklbw m7, m3 + punpcklbw m5, m7 + + pmaddubsw m5, m0 + + phaddw m4, m5 + + pmulhrsw m4, m1 + packuswb m2, m4 + movd [r2], m2 + pextrd [r2 + r3], m2, 1 + lea r2, [r2 + 2 * r3] + pextrd [r2], m2, 2 + pextrd [r2 + r3], m2, 3 + + lea r2, [r2 + 2 * r3] + + sub r4, 4 + jnz .loop + RET +%endmacro + + FILTER_V4_W4_H4 4, 8 + FILTER_V4_W4_H4 4, 16 + + FILTER_V4_W4_H4 4, 32 + +%macro FILTER_V4_W8_H2 0 + punpcklbw m1, m2 + punpcklbw m7, m3, m0 + + pmaddubsw m1, m6 + pmaddubsw m7, m5 + + paddw m1, m7 + + pmulhrsw m1, m4 + packuswb m1, m1 +%endmacro + +%macro FILTER_V4_W8_H3 0 + punpcklbw m2, m3 + punpcklbw m7, m0, m1 + + pmaddubsw m2, m6 + pmaddubsw m7, m5 + + paddw m2, m7 + + pmulhrsw m2, m4 + packuswb m2, m2 +%endmacro + +%macro FILTER_V4_W8_H4 0 + punpcklbw m3, m0 + punpcklbw m7, m1, m2 + + pmaddubsw m3, m6 + pmaddubsw m7, m5 + + paddw m3, m7 + + pmulhrsw m3, m4 + packuswb m3, m3 +%endmacro + +%macro FILTER_V4_W8_H5 0 + punpcklbw m0, m1 + punpcklbw m7, m2, m3 + + pmaddubsw m0, m6 + pmaddubsw m7, m5 + + paddw m0, m7 + + pmulhrsw m0, m4 + packuswb m0, m0 +%endmacro + +%macro FILTER_V4_W8_8x2 2 + FILTER_V4_W8 %1, %2 + movq m0, [r0 + 4 * r1] + + FILTER_V4_W8_H2 + + movh [r2 + r3], m1 +%endmacro + +%macro FILTER_V4_W8_8x4 2 + FILTER_V4_W8_8x2 %1, %2 +;8x3 + lea r6, [r0 + 4 * r1] + movq m1, [r6 + r1] + + FILTER_V4_W8_H3 + + movh [r2 + 2 * r3], m2 + +;8x4 + movq m2, [r6 + 2 * r1] + + FILTER_V4_W8_H4 + + lea r5, [r2 + 2 * r3] + movh [r5 + r3], m3 +%endmacro + +%macro FILTER_V4_W8_8x6 2 + FILTER_V4_W8_8x4 %1, %2 +;8x5 + lea r6, [r6 + 2 * r1] + movq m3, [r6 + r1] + + FILTER_V4_W8_H5 + + movh [r2 + 4 * r3], m0 + +;8x6 + movq m0, [r0 + 8 * r1] + + FILTER_V4_W8_H2 + + lea r5, [r2 + 4 * r3] + movh [r5 + r3], m1 +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W8 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_pp_%1x%2, 4, 7, 8 + + mov r4d, r4m + + sub r0, r1 + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + lea r5, [r0 + 2 * r1] + movq m3, [r5 + r1] + + punpcklbw m0, m1 + punpcklbw m4, m2, m3 + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + movd m5, [r6 + r4 * 4] +%else + movd m5, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m6, m5, [tab_Vm] + pmaddubsw m0, m6 + + pshufb m5, [tab_Vm + 16] + pmaddubsw m4, m5 + + paddw m0, m4 + + mova m4, [pw_512] + + pmulhrsw m0, m4 + packuswb m0, m0 + movh [r2], m0 +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_8x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- + FILTER_V4_W8_8x2 8, 2 + + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_8x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- + FILTER_V4_W8_8x4 8, 4 + + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_8x6(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- + FILTER_V4_W8_8x6 8, 6 + + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_ps_4x2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal interp_4tap_vert_ps_4x2, 4, 6, 6 + + mov r4d, r4m + sub r0, r1 + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m0, [tab_Cm] + + movd m2, [r0] + movd m3, [r0 + r1] + lea r5, [r0 + 2 * r1] + movd m4, [r5] + movd m5, [r5 + r1] + + punpcklbw m2, m3 + punpcklbw m1, m4, m5 + punpcklbw m2, m1 + + pmaddubsw m2, m0 + + movd m1, [r0 + 4 * r1] + + punpcklbw m3, m4 + punpcklbw m5, m1 + punpcklbw m3, m5 + + pmaddubsw m3, m0 + + phaddw m2, m3 + + psubw m2, [pw_2000] + movh [r2], m2 + movhps [r2 + r3], m2 + + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_ps_4x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal interp_4tap_vert_ps_4x4, 4, 6, 7 + + mov r4d, r4m + sub r0, r1 + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m0, [tab_Cm] + + lea r4, [r1 * 3] + lea r5, [r0 + 4 * r1] + + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r4] + + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklbw m2, m6 + + pmaddubsw m2, m0 + + movd m6, [r5] + + punpcklbw m3, m4 + punpcklbw m1, m5, m6 + punpcklbw m3, m1 + + pmaddubsw m3, m0 + + phaddw m2, m3 + + mova m1, [pw_2000] + + psubw m2, m1 + movh [r2], m2 + movhps [r2 + r3], m2 + + movd m2, [r5 + r1] + + punpcklbw m4, m5 + punpcklbw m3, m6, m2 + punpcklbw m4, m3 + + pmaddubsw m4, m0 + + movd m3, [r5 + 2 * r1] + + punpcklbw m5, m6 + punpcklbw m2, m3 + punpcklbw m5, m2 + + pmaddubsw m5, m0 + + phaddw m4, m5 + + psubw m4, m1 + lea r2, [r2 + 2 * r3] + movh [r2], m4 + movhps [r2 + r3], m4 + + RET + +;--------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_ps_%1x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;--------------------------------------------------------------------------------------------------------------- +%macro FILTER_V_PS_W4_H4 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_ps_%1x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m0, [tab_Cm] + + mova m1, [pw_2000] + + mov r4d, %2/4 + lea r5, [3 * r1] + +.loop: + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r5] + + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklbw m2, m6 + + pmaddubsw m2, m0 + + lea r0, [r0 + 4 * r1] + movd m6, [r0] + + punpcklbw m3, m4 + punpcklbw m7, m5, m6 + punpcklbw m3, m7 + + pmaddubsw m3, m0 + + phaddw m2, m3 + + psubw m2, m1 + movh [r2], m2 + movhps [r2 + r3], m2 + + movd m2, [r0 + r1] + + punpcklbw m4, m5 + punpcklbw m3, m6, m2 + punpcklbw m4, m3 + + pmaddubsw m4, m0 + + movd m3, [r0 + 2 * r1] + + punpcklbw m5, m6 + punpcklbw m2, m3 + punpcklbw m5, m2 + + pmaddubsw m5, m0 + + phaddw m4, m5 + + psubw m4, m1 + lea r2, [r2 + 2 * r3] + movh [r2], m4 + movhps [r2 + r3], m4 + + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loop + RET +%endmacro + + FILTER_V_PS_W4_H4 4, 8 + FILTER_V_PS_W4_H4 4, 16 + + FILTER_V_PS_W4_H4 4, 32 + +;-------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;-------------------------------------------------------------------------------------------------------------- +%macro FILTER_V_PS_W8_H8_H16_H2 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_ps_%1x%2, 4, 6, 7 + + mov r4d, r4m + sub r0, r1 + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m5, [r5 + r4 * 4] +%else + movd m5, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m6, m5, [tab_Vm] + pshufb m5, [tab_Vm + 16] + mova m4, [pw_2000] + + mov r4d, %2/2 + lea r5, [3 * r1] + +.loopH: + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + movq m3, [r0 + r5] + + punpcklbw m0, m1 + punpcklbw m1, m2 + punpcklbw m2, m3 + + pmaddubsw m0, m6 + pmaddubsw m2, m5 + + paddw m0, m2 + + psubw m0, m4 + movu [r2], m0 + + movq m0, [r0 + 4 * r1] + + punpcklbw m3, m0 + + pmaddubsw m1, m6 + pmaddubsw m3, m5 + + paddw m1, m3 + psubw m1, m4 + + movu [r2 + r3], m1 + + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loopH + + RET +%endmacro + + FILTER_V_PS_W8_H8_H16_H2 8, 2 + FILTER_V_PS_W8_H8_H16_H2 8, 4 + FILTER_V_PS_W8_H8_H16_H2 8, 6 + + FILTER_V_PS_W8_H8_H16_H2 8, 12 + FILTER_V_PS_W8_H8_H16_H2 8, 64 + +;-------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;-------------------------------------------------------------------------------------------------------------- +%macro FILTER_V_PS_W8_H8_H16_H32 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_ps_%1x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m5, [r5 + r4 * 4] +%else + movd m5, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m6, m5, [tab_Vm] + pshufb m5, [tab_Vm + 16] + mova m4, [pw_2000] + + mov r4d, %2/4 + lea r5, [3 * r1] + +.loop: + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + movq m3, [r0 + r5] + + punpcklbw m0, m1 + punpcklbw m1, m2 + punpcklbw m2, m3 + + pmaddubsw m0, m6 + pmaddubsw m7, m2, m5 + + paddw m0, m7 + + psubw m0, m4 + movu [r2], m0 + + lea r0, [r0 + 4 * r1] + movq m0, [r0] + + punpcklbw m3, m0 + + pmaddubsw m1, m6 + pmaddubsw m7, m3, m5 + + paddw m1, m7 + + psubw m1, m4 + movu [r2 + r3], m1 + + movq m1, [r0 + r1] + + punpcklbw m0, m1 + + pmaddubsw m2, m6 + pmaddubsw m0, m5 + + paddw m2, m0 + + psubw m2, m4 + lea r2, [r2 + 2 * r3] + movu [r2], m2 + + movq m2, [r0 + 2 * r1] + + punpcklbw m1, m2 + + pmaddubsw m3, m6 + pmaddubsw m1, m5 + + paddw m3, m1 + psubw m3, m4 + + movu [r2 + r3], m3 + + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loop + RET +%endmacro + + FILTER_V_PS_W8_H8_H16_H32 8, 8 + FILTER_V_PS_W8_H8_H16_H32 8, 16 + FILTER_V_PS_W8_H8_H16_H32 8, 32 + +;------------------------------------------------------------------------------------------------------------ +;void interp_4tap_vert_ps_6x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------ +%macro FILTER_V_PS_W6 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_ps_6x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m5, [r5 + r4 * 4] +%else + movd m5, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m6, m5, [tab_Vm] + pshufb m5, [tab_Vm + 16] + mova m4, [pw_2000] + lea r5, [3 * r1] + mov r4d, %2/4 + +.loop: + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + movq m3, [r0 + r5] + + punpcklbw m0, m1 + punpcklbw m1, m2 + punpcklbw m2, m3 + + pmaddubsw m0, m6 + pmaddubsw m7, m2, m5 + + paddw m0, m7 + psubw m0, m4 + + movh [r2], m0 + pshufd m0, m0, 2 + movd [r2 + 8], m0 + + lea r0, [r0 + 4 * r1] + movq m0, [r0] + punpcklbw m3, m0 + + pmaddubsw m1, m6 + pmaddubsw m7, m3, m5 + + paddw m1, m7 + psubw m1, m4 + + movh [r2 + r3], m1 + pshufd m1, m1, 2 + movd [r2 + r3 + 8], m1 + + movq m1, [r0 + r1] + punpcklbw m0, m1 + + pmaddubsw m2, m6 + pmaddubsw m0, m5 + + paddw m2, m0 + psubw m2, m4 + + lea r2,[r2 + 2 * r3] + movh [r2], m2 + pshufd m2, m2, 2 + movd [r2 + 8], m2 + + movq m2,[r0 + 2 * r1] + punpcklbw m1, m2 + + pmaddubsw m3, m6 + pmaddubsw m1, m5 + + paddw m3, m1 + psubw m3, m4 + + movh [r2 + r3], m3 + pshufd m3, m3, 2 + movd [r2 + r3 + 8], m3 + + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loop + RET +%endmacro + + FILTER_V_PS_W6 6, 8 + FILTER_V_PS_W6 6, 16 + +;--------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_ps_12x16(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;--------------------------------------------------------------------------------------------------------------- +%macro FILTER_V_PS_W12 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_ps_12x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] + + mov r4d, %2/2 + +.loop: + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + pmaddubsw m4, m1 + pmaddubsw m2, m1 + + lea r0, [r0 + 2 * r1] + movu m5, [r0] + movu m7, [r0 + r1] + + punpcklbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m4, m6 + + punpckhbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m2, m6 + + mova m6, [pw_2000] + + psubw m4, m6 + psubw m2, m6 + + movu [r2], m4 + movh [r2 + 16], m2 + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + pmaddubsw m4, m1 + pmaddubsw m3, m1 + + movu m2, [r0 + 2 * r1] + + punpcklbw m5, m7, m2 + punpckhbw m7, m2 + + pmaddubsw m5, m0 + pmaddubsw m7, m0 + + paddw m4, m5 + paddw m3, m7 + + psubw m4, m6 + psubw m3, m6 + + movu [r2 + r3], m4 + movh [r2 + r3 + 16], m3 + + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loop + RET +%endmacro + + FILTER_V_PS_W12 12, 16 + FILTER_V_PS_W12 12, 32 + +;--------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_ps_16x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;--------------------------------------------------------------------------------------------------------------- +%macro FILTER_V_PS_W16 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_ps_%1x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] + mov r4d, %2/2 + +.loop: + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + pmaddubsw m4, m1 + pmaddubsw m2, m1 + + lea r0, [r0 + 2 * r1] + movu m5, [r0] + movu m7, [r0 + r1] + + punpcklbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m4, m6 + + punpckhbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m2, m6 + + mova m6, [pw_2000] + + psubw m4, m6 + psubw m2, m6 + + movu [r2], m4 + movu [r2 + 16], m2 + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + pmaddubsw m4, m1 + pmaddubsw m3, m1 + + movu m5, [r0 + 2 * r1] + + punpcklbw m2, m7, m5 + punpckhbw m7, m5 + + pmaddubsw m2, m0 + pmaddubsw m7, m0 + + paddw m4, m2 + paddw m3, m7 + + psubw m4, m6 + psubw m3, m6 + + movu [r2 + r3], m4 + movu [r2 + r3 + 16], m3 + + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loop + RET +%endmacro + + FILTER_V_PS_W16 16, 4 + FILTER_V_PS_W16 16, 8 + FILTER_V_PS_W16 16, 12 + FILTER_V_PS_W16 16, 16 + FILTER_V_PS_W16 16, 32 + + FILTER_V_PS_W16 16, 24 + FILTER_V_PS_W16 16, 64 + +;-------------------------------------------------------------------------------------------------------------- +;void interp_4tap_vert_ps_24x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;-------------------------------------------------------------------------------------------------------------- +%macro FILTER_V4_PS_W24 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_ps_24x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] + + mov r4d, %2/2 + +.loop: + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + pmaddubsw m4, m1 + pmaddubsw m2, m1 + + lea r5, [r0 + 2 * r1] + + movu m5, [r5] + movu m7, [r5 + r1] + + punpcklbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m4, m6 + + punpckhbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m2, m6 + + mova m6, [pw_2000] + + psubw m4, m6 + psubw m2, m6 + + movu [r2], m4 + movu [r2 + 16], m2 + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + pmaddubsw m4, m1 + pmaddubsw m3, m1 + + movu m2, [r5 + 2 * r1] + + punpcklbw m5, m7, m2 + punpckhbw m7, m2 + + pmaddubsw m5, m0 + pmaddubsw m7, m0 + + paddw m4, m5 + paddw m3, m7 + + psubw m4, m6 + psubw m3, m6 + + movu [r2 + r3], m4 + movu [r2 + r3 + 16], m3 + + movq m2, [r0 + 16] + movq m3, [r0 + r1 + 16] + movq m4, [r5 + 16] + movq m5, [r5 + r1 + 16] + + punpcklbw m2, m3 + punpcklbw m7, m4, m5 + + pmaddubsw m2, m1 + pmaddubsw m7, m0 + + paddw m2, m7 + psubw m2, m6 + + movu [r2 + 32], m2 + + movq m2, [r5 + 2 * r1 + 16] + + punpcklbw m3, m4 + punpcklbw m5, m2 + + pmaddubsw m3, m1 + pmaddubsw m5, m0 + + paddw m3, m5 + psubw m3, m6 + + movu [r2 + r3 + 32], m3 + + mov r0, r5 + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loop + RET +%endmacro + + FILTER_V4_PS_W24 24, 32 + + FILTER_V4_PS_W24 24, 64 + +;--------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_ps_32x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;--------------------------------------------------------------------------------------------------------------- +%macro FILTER_V_PS_W32 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_ps_%1x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] + + mova m7, [pw_2000] + + mov r4d, %2 + +.loop: + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + pmaddubsw m4, m1 + pmaddubsw m2, m1 + + lea r5, [r0 + 2 * r1] + movu m3, [r5] + movu m5, [r5 + r1] + + punpcklbw m6, m3, m5 + punpckhbw m3, m5 + + pmaddubsw m6, m0 + pmaddubsw m3, m0 + + paddw m4, m6 + paddw m2, m3 + + psubw m4, m7 + psubw m2, m7 + + movu [r2], m4 + movu [r2 + 16], m2 + + movu m2, [r0 + 16] + movu m3, [r0 + r1 + 16] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + pmaddubsw m4, m1 + pmaddubsw m2, m1 + + movu m3, [r5 + 16] + movu m5, [r5 + r1 + 16] + + punpcklbw m6, m3, m5 + punpckhbw m3, m5 + + pmaddubsw m6, m0 + pmaddubsw m3, m0 + + paddw m4, m6 + paddw m2, m3 + + psubw m4, m7 + psubw m2, m7 + + movu [r2 + 32], m4 + movu [r2 + 48], m2 + + lea r0, [r0 + r1] + lea r2, [r2 + r3] + + dec r4d + jnz .loop + RET +%endmacro + + FILTER_V_PS_W32 32, 8 + FILTER_V_PS_W32 32, 16 + FILTER_V_PS_W32 32, 24 + FILTER_V_PS_W32 32, 32 + + FILTER_V_PS_W32 32, 48 + FILTER_V_PS_W32 32, 64 + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W8_H8_H16_H32 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m5, [r5 + r4 * 4] +%else + movd m5, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m6, m5, [tab_Vm] + pshufb m5, [tab_Vm + 16] + mova m4, [pw_512] + lea r5, [r1 * 3] + + mov r4d, %2 + +.loop: + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + movq m3, [r0 + r5] + + punpcklbw m0, m1 + punpcklbw m1, m2 + punpcklbw m2, m3 + + pmaddubsw m0, m6 + pmaddubsw m7, m2, m5 + + paddw m0, m7 + + pmulhrsw m0, m4 + packuswb m0, m0 + movh [r2], m0 + + lea r0, [r0 + 4 * r1] + movq m0, [r0] + + punpcklbw m3, m0 + + pmaddubsw m1, m6 + pmaddubsw m7, m3, m5 + + paddw m1, m7 + + pmulhrsw m1, m4 + packuswb m1, m1 + movh [r2 + r3], m1 + + movq m1, [r0 + r1] + + punpcklbw m0, m1 + + pmaddubsw m2, m6 + pmaddubsw m0, m5 + + paddw m2, m0 + + pmulhrsw m2, m4 + + movq m7, [r0 + 2 * r1] + punpcklbw m1, m7 + + pmaddubsw m3, m6 + pmaddubsw m1, m5 + + paddw m3, m1 + + pmulhrsw m3, m4 + packuswb m2, m3 + + lea r2, [r2 + 2 * r3] + movh [r2], m2 + movhps [r2 + r3], m2 + + lea r2, [r2 + 2 * r3] + + sub r4, 4 + jnz .loop + RET +%endmacro + + FILTER_V4_W8_H8_H16_H32 8, 8 + FILTER_V4_W8_H8_H16_H32 8, 16 + FILTER_V4_W8_H8_H16_H32 8, 32 + + FILTER_V4_W8_H8_H16_H32 8, 12 + FILTER_V4_W8_H8_H16_H32 8, 64 + +%macro PROCESS_CHROMA_AVX2_W8_8R 0 + movq xm1, [r0] ; m1 = row 0 + movq xm2, [r0 + r1] ; m2 = row 1 + punpcklbw xm1, xm2 ; m1 = [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] + movq xm3, [r0 + r1 * 2] ; m3 = row 2 + punpcklbw xm2, xm3 ; m2 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] + vinserti128 m5, m1, xm2, 1 ; m5 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] + pmaddubsw m5, [r5] + movq xm4, [r0 + r4] ; m4 = row 3 + punpcklbw xm3, xm4 ; m3 = [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] + lea r0, [r0 + r1 * 4] + movq xm1, [r0] ; m1 = row 4 + punpcklbw xm4, xm1 ; m4 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] + vinserti128 m2, m3, xm4, 1 ; m2 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] + pmaddubsw m0, m2, [r5 + 1 * mmsize] + paddw m5, m0 + pmaddubsw m2, [r5] + movq xm3, [r0 + r1] ; m3 = row 5 + punpcklbw xm1, xm3 ; m1 = [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40] + movq xm4, [r0 + r1 * 2] ; m4 = row 6 + punpcklbw xm3, xm4 ; m3 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] + vinserti128 m1, m1, xm3, 1 ; m1 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] - [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40] + pmaddubsw m0, m1, [r5 + 1 * mmsize] + paddw m2, m0 + pmaddubsw m1, [r5] + movq xm3, [r0 + r4] ; m3 = row 7 + punpcklbw xm4, xm3 ; m4 = [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60] + lea r0, [r0 + r1 * 4] + movq xm0, [r0] ; m0 = row 8 + punpcklbw xm3, xm0 ; m3 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70] + vinserti128 m4, m4, xm3, 1 ; m4 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70] - [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60] + pmaddubsw m3, m4, [r5 + 1 * mmsize] + paddw m1, m3 + pmaddubsw m4, [r5] + movq xm3, [r0 + r1] ; m3 = row 9 + punpcklbw xm0, xm3 ; m0 = [97 87 96 86 95 85 94 84 93 83 92 82 91 81 90 80] + movq xm6, [r0 + r1 * 2] ; m6 = row 10 + punpcklbw xm3, xm6 ; m3 = [A7 97 A6 96 A5 95 A4 94 A3 93 A2 92 A1 91 A0 90] + vinserti128 m0, m0, xm3, 1 ; m0 = [A7 97 A6 96 A5 95 A4 94 A3 93 A2 92 A1 91 A0 90] - [97 87 96 86 95 85 94 84 93 83 92 82 91 81 90 80] + pmaddubsw m0, [r5 + 1 * mmsize] + paddw m4, m0 +%endmacro + +%macro FILTER_VER_CHROMA_AVX2_8x8 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x8, 4, 6, 7 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + PROCESS_CHROMA_AVX2_W8_8R +%ifidn %1,pp + lea r4, [r3 * 3] + mova m3, [pw_512] + pmulhrsw m5, m3 ; m5 = word: row 0, row 1 + pmulhrsw m2, m3 ; m2 = word: row 2, row 3 + pmulhrsw m1, m3 ; m1 = word: row 4, row 5 + pmulhrsw m4, m3 ; m4 = word: row 6, row 7 + packuswb m5, m2 + packuswb m1, m4 + vextracti128 xm2, m5, 1 + vextracti128 xm4, m1, 1 + movq [r2], xm5 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm5 + movhps [r2 + r4], xm2 + lea r2, [r2 + r3 * 4] + movq [r2], xm1 + movq [r2 + r3], xm4 + movhps [r2 + r3 * 2], xm1 + movhps [r2 + r4], xm4 +%else + add r3d, r3d + vbroadcasti128 m3, [pw_2000] + lea r4, [r3 * 3] + psubw m5, m3 ; m5 = word: row 0, row 1 + psubw m2, m3 ; m2 = word: row 2, row 3 + psubw m1, m3 ; m1 = word: row 4, row 5 + psubw m4, m3 ; m4 = word: row 6, row 7 + vextracti128 xm6, m5, 1 + vextracti128 xm3, m2, 1 + vextracti128 xm0, m1, 1 + movu [r2], xm5 + movu [r2 + r3], xm6 + movu [r2 + r3 * 2], xm2 + movu [r2 + r4], xm3 + lea r2, [r2 + r3 * 4] + movu [r2], xm1 + movu [r2 + r3], xm0 + movu [r2 + r3 * 2], xm4 + vextracti128 xm4, m4, 1 + movu [r2 + r4], xm4 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_8x8 pp + FILTER_VER_CHROMA_AVX2_8x8 ps + +%macro FILTER_VER_CHROMA_AVX2_8x6 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x6, 4, 6, 6 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + + movq xm1, [r0] ; m1 = row 0 + movq xm2, [r0 + r1] ; m2 = row 1 + punpcklbw xm1, xm2 ; m1 = [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] + movq xm3, [r0 + r1 * 2] ; m3 = row 2 + punpcklbw xm2, xm3 ; m2 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] + vinserti128 m5, m1, xm2, 1 ; m5 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] + pmaddubsw m5, [r5] + movq xm4, [r0 + r4] ; m4 = row 3 + punpcklbw xm3, xm4 ; m3 = [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] + lea r0, [r0 + r1 * 4] + movq xm1, [r0] ; m1 = row 4 + punpcklbw xm4, xm1 ; m4 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] + vinserti128 m2, m3, xm4, 1 ; m2 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] + pmaddubsw m0, m2, [r5 + 1 * mmsize] + paddw m5, m0 + pmaddubsw m2, [r5] + movq xm3, [r0 + r1] ; m3 = row 5 + punpcklbw xm1, xm3 ; m1 = [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40] + movq xm4, [r0 + r1 * 2] ; m4 = row 6 + punpcklbw xm3, xm4 ; m3 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] + vinserti128 m1, m1, xm3, 1 ; m1 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] - [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40] + pmaddubsw m0, m1, [r5 + 1 * mmsize] + paddw m2, m0 + pmaddubsw m1, [r5] + movq xm3, [r0 + r4] ; m3 = row 7 + punpcklbw xm4, xm3 ; m4 = [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60] + lea r0, [r0 + r1 * 4] + movq xm0, [r0] ; m0 = row 8 + punpcklbw xm3, xm0 ; m3 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70] + vinserti128 m4, m4, xm3, 1 ; m4 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70] - [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60] + pmaddubsw m4, [r5 + 1 * mmsize] + paddw m1, m4 +%ifidn %1,pp + lea r4, [r3 * 3] + mova m3, [pw_512] + pmulhrsw m5, m3 ; m5 = word: row 0, row 1 + pmulhrsw m2, m3 ; m2 = word: row 2, row 3 + pmulhrsw m1, m3 ; m1 = word: row 4, row 5 + packuswb m5, m2 + packuswb m1, m1 + vextracti128 xm2, m5, 1 + vextracti128 xm4, m1, 1 + movq [r2], xm5 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm5 + movhps [r2 + r4], xm2 + lea r2, [r2 + r3 * 4] + movq [r2], xm1 + movq [r2 + r3], xm4 +%else + add r3d, r3d + mova m3, [pw_2000] + lea r4, [r3 * 3] + psubw m5, m3 ; m5 = word: row 0, row 1 + psubw m2, m3 ; m2 = word: row 2, row 3 + psubw m1, m3 ; m1 = word: row 4, row 5 + vextracti128 xm4, m5, 1 + vextracti128 xm3, m2, 1 + vextracti128 xm0, m1, 1 + movu [r2], xm5 + movu [r2 + r3], xm4 + movu [r2 + r3 * 2], xm2 + movu [r2 + r4], xm3 + lea r2, [r2 + r3 * 4] + movu [r2], xm1 + movu [r2 + r3], xm0 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_8x6 pp + FILTER_VER_CHROMA_AVX2_8x6 ps + +%macro PROCESS_CHROMA_AVX2_W8_16R 1 + movq xm1, [r0] ; m1 = row 0 + movq xm2, [r0 + r1] ; m2 = row 1 + punpcklbw xm1, xm2 + movq xm3, [r0 + r1 * 2] ; m3 = row 2 + punpcklbw xm2, xm3 + vinserti128 m5, m1, xm2, 1 + pmaddubsw m5, [r5] + movq xm4, [r0 + r4] ; m4 = row 3 + punpcklbw xm3, xm4 + lea r0, [r0 + r1 * 4] + movq xm1, [r0] ; m1 = row 4 + punpcklbw xm4, xm1 + vinserti128 m2, m3, xm4, 1 + pmaddubsw m0, m2, [r5 + 1 * mmsize] + paddw m5, m0 + pmaddubsw m2, [r5] + movq xm3, [r0 + r1] ; m3 = row 5 + punpcklbw xm1, xm3 + movq xm4, [r0 + r1 * 2] ; m4 = row 6 + punpcklbw xm3, xm4 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m0, m1, [r5 + 1 * mmsize] + paddw m2, m0 + pmaddubsw m1, [r5] + movq xm3, [r0 + r4] ; m3 = row 7 + punpcklbw xm4, xm3 + lea r0, [r0 + r1 * 4] + movq xm0, [r0] ; m0 = row 8 + punpcklbw xm3, xm0 + vinserti128 m4, m4, xm3, 1 + pmaddubsw m3, m4, [r5 + 1 * mmsize] + paddw m1, m3 + pmaddubsw m4, [r5] + movq xm3, [r0 + r1] ; m3 = row 9 + punpcklbw xm0, xm3 + movq xm6, [r0 + r1 * 2] ; m6 = row 10 + punpcklbw xm3, xm6 + vinserti128 m0, m0, xm3, 1 + pmaddubsw m3, m0, [r5 + 1 * mmsize] + paddw m4, m3 + pmaddubsw m0, [r5] +%ifidn %1,pp + pmulhrsw m5, m7 ; m5 = word: row 0, row 1 + pmulhrsw m2, m7 ; m2 = word: row 2, row 3 + pmulhrsw m1, m7 ; m1 = word: row 4, row 5 + pmulhrsw m4, m7 ; m4 = word: row 6, row 7 + packuswb m5, m2 + packuswb m1, m4 + vextracti128 xm2, m5, 1 + vextracti128 xm4, m1, 1 + movq [r2], xm5 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm5 + movhps [r2 + r6], xm2 + lea r2, [r2 + r3 * 4] + movq [r2], xm1 + movq [r2 + r3], xm4 + movhps [r2 + r3 * 2], xm1 + movhps [r2 + r6], xm4 +%else + psubw m5, m7 ; m5 = word: row 0, row 1 + psubw m2, m7 ; m2 = word: row 2, row 3 + psubw m1, m7 ; m1 = word: row 4, row 5 + psubw m4, m7 ; m4 = word: row 6, row 7 + vextracti128 xm3, m5, 1 + movu [r2], xm5 + movu [r2 + r3], xm3 + vextracti128 xm3, m2, 1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm3 + lea r2, [r2 + r3 * 4] + vextracti128 xm5, m1, 1 + vextracti128 xm3, m4, 1 + movu [r2], xm1 + movu [r2 + r3], xm5 + movu [r2 + r3 * 2], xm4 + movu [r2 + r6], xm3 +%endif + movq xm3, [r0 + r4] ; m3 = row 11 + punpcklbw xm6, xm3 + lea r0, [r0 + r1 * 4] + movq xm5, [r0] ; m5 = row 12 + punpcklbw xm3, xm5 + vinserti128 m6, m6, xm3, 1 + pmaddubsw m3, m6, [r5 + 1 * mmsize] + paddw m0, m3 + pmaddubsw m6, [r5] + movq xm3, [r0 + r1] ; m3 = row 13 + punpcklbw xm5, xm3 + movq xm2, [r0 + r1 * 2] ; m2 = row 14 + punpcklbw xm3, xm2 + vinserti128 m5, m5, xm3, 1 + pmaddubsw m3, m5, [r5 + 1 * mmsize] + paddw m6, m3 + pmaddubsw m5, [r5] + movq xm3, [r0 + r4] ; m3 = row 15 + punpcklbw xm2, xm3 + lea r0, [r0 + r1 * 4] + movq xm1, [r0] ; m1 = row 16 + punpcklbw xm3, xm1 + vinserti128 m2, m2, xm3, 1 + pmaddubsw m3, m2, [r5 + 1 * mmsize] + paddw m5, m3 + pmaddubsw m2, [r5] + movq xm3, [r0 + r1] ; m3 = row 17 + punpcklbw xm1, xm3 + movq xm4, [r0 + r1 * 2] ; m4 = row 18 + punpcklbw xm3, xm4 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m1, [r5 + 1 * mmsize] + paddw m2, m1 + lea r2, [r2 + r3 * 4] +%ifidn %1,pp + pmulhrsw m0, m7 ; m0 = word: row 8, row 9 + pmulhrsw m6, m7 ; m6 = word: row 10, row 11 + pmulhrsw m5, m7 ; m5 = word: row 12, row 13 + pmulhrsw m2, m7 ; m2 = word: row 14, row 15 + packuswb m0, m6 + packuswb m5, m2 + vextracti128 xm6, m0, 1 + vextracti128 xm2, m5, 1 + movq [r2], xm0 + movq [r2 + r3], xm6 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r6], xm6 + lea r2, [r2 + r3 * 4] + movq [r2], xm5 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm5 + movhps [r2 + r6], xm2 +%else + psubw m0, m7 ; m0 = word: row 8, row 9 + psubw m6, m7 ; m6 = word: row 10, row 11 + psubw m5, m7 ; m5 = word: row 12, row 13 + psubw m2, m7 ; m2 = word: row 14, row 15 + vextracti128 xm1, m0, 1 + vextracti128 xm3, m6, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm6 + movu [r2 + r6], xm3 + lea r2, [r2 + r3 * 4] + vextracti128 xm1, m5, 1 + vextracti128 xm3, m2, 1 + movu [r2], xm5 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm3 +%endif +%endmacro + +%macro FILTER_VER_CHROMA_AVX2_8x16 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x16, 4, 7, 8 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m7, [pw_512] +%else + add r3d, r3d + mova m7, [pw_2000] +%endif + lea r6, [r3 * 3] + PROCESS_CHROMA_AVX2_W8_16R %1 + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_8x16 pp + FILTER_VER_CHROMA_AVX2_8x16 ps + +%macro FILTER_VER_CHROMA_AVX2_8x12 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x12, 4, 7, 8 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1, pp + mova m7, [pw_512] +%else + add r3d, r3d + mova m7, [pw_2000] +%endif + lea r6, [r3 * 3] + movq xm1, [r0] ; m1 = row 0 + movq xm2, [r0 + r1] ; m2 = row 1 + punpcklbw xm1, xm2 + movq xm3, [r0 + r1 * 2] ; m3 = row 2 + punpcklbw xm2, xm3 + vinserti128 m5, m1, xm2, 1 + pmaddubsw m5, [r5] + movq xm4, [r0 + r4] ; m4 = row 3 + punpcklbw xm3, xm4 + lea r0, [r0 + r1 * 4] + movq xm1, [r0] ; m1 = row 4 + punpcklbw xm4, xm1 + vinserti128 m2, m3, xm4, 1 + pmaddubsw m0, m2, [r5 + 1 * mmsize] + paddw m5, m0 + pmaddubsw m2, [r5] + movq xm3, [r0 + r1] ; m3 = row 5 + punpcklbw xm1, xm3 + movq xm4, [r0 + r1 * 2] ; m4 = row 6 + punpcklbw xm3, xm4 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m0, m1, [r5 + 1 * mmsize] + paddw m2, m0 + pmaddubsw m1, [r5] + movq xm3, [r0 + r4] ; m3 = row 7 + punpcklbw xm4, xm3 + lea r0, [r0 + r1 * 4] + movq xm0, [r0] ; m0 = row 8 + punpcklbw xm3, xm0 + vinserti128 m4, m4, xm3, 1 + pmaddubsw m3, m4, [r5 + 1 * mmsize] + paddw m1, m3 + pmaddubsw m4, [r5] + movq xm3, [r0 + r1] ; m3 = row 9 + punpcklbw xm0, xm3 + movq xm6, [r0 + r1 * 2] ; m6 = row 10 + punpcklbw xm3, xm6 + vinserti128 m0, m0, xm3, 1 + pmaddubsw m3, m0, [r5 + 1 * mmsize] + paddw m4, m3 + pmaddubsw m0, [r5] +%ifidn %1, pp + pmulhrsw m5, m7 ; m5 = word: row 0, row 1 + pmulhrsw m2, m7 ; m2 = word: row 2, row 3 + pmulhrsw m1, m7 ; m1 = word: row 4, row 5 + pmulhrsw m4, m7 ; m4 = word: row 6, row 7 + packuswb m5, m2 + packuswb m1, m4 + vextracti128 xm2, m5, 1 + vextracti128 xm4, m1, 1 + movq [r2], xm5 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm5 + movhps [r2 + r6], xm2 + lea r2, [r2 + r3 * 4] + movq [r2], xm1 + movq [r2 + r3], xm4 + movhps [r2 + r3 * 2], xm1 + movhps [r2 + r6], xm4 +%else + psubw m5, m7 ; m5 = word: row 0, row 1 + psubw m2, m7 ; m2 = word: row 2, row 3 + psubw m1, m7 ; m1 = word: row 4, row 5 + psubw m4, m7 ; m4 = word: row 6, row 7 + vextracti128 xm3, m5, 1 + movu [r2], xm5 + movu [r2 + r3], xm3 + vextracti128 xm3, m2, 1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm3 + lea r2, [r2 + r3 * 4] + vextracti128 xm5, m1, 1 + vextracti128 xm3, m4, 1 + movu [r2], xm1 + movu [r2 + r3], xm5 + movu [r2 + r3 * 2], xm4 + movu [r2 + r6], xm3 +%endif + movq xm3, [r0 + r4] ; m3 = row 11 + punpcklbw xm6, xm3 + lea r0, [r0 + r1 * 4] + movq xm5, [r0] ; m5 = row 12 + punpcklbw xm3, xm5 + vinserti128 m6, m6, xm3, 1 + pmaddubsw m3, m6, [r5 + 1 * mmsize] + paddw m0, m3 + pmaddubsw m6, [r5] + movq xm3, [r0 + r1] ; m3 = row 13 + punpcklbw xm5, xm3 + movq xm2, [r0 + r1 * 2] ; m2 = row 14 + punpcklbw xm3, xm2 + vinserti128 m5, m5, xm3, 1 + pmaddubsw m3, m5, [r5 + 1 * mmsize] + paddw m6, m3 + lea r2, [r2 + r3 * 4] +%ifidn %1, pp + pmulhrsw m0, m7 ; m0 = word: row 8, row 9 + pmulhrsw m6, m7 ; m6 = word: row 10, row 11 + packuswb m0, m6 + vextracti128 xm6, m0, 1 + movq [r2], xm0 + movq [r2 + r3], xm6 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r6], xm6 +%else + psubw m0, m7 ; m0 = word: row 8, row 9 + psubw m6, m7 ; m6 = word: row 10, row 11 + vextracti128 xm1, m0, 1 + vextracti128 xm3, m6, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm6 + movu [r2 + r6], xm3 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_8x12 pp + FILTER_VER_CHROMA_AVX2_8x12 ps + +%macro FILTER_VER_CHROMA_AVX2_8xN 2 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x%2, 4, 7, 8 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m7, [pw_512] +%else + add r3d, r3d + mova m7, [pw_2000] +%endif + lea r6, [r3 * 3] +%rep %2 / 16 + PROCESS_CHROMA_AVX2_W8_16R %1 + lea r2, [r2 + r3 * 4] +%endrep + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_8xN pp, 32 + FILTER_VER_CHROMA_AVX2_8xN ps, 32 + FILTER_VER_CHROMA_AVX2_8xN pp, 64 + FILTER_VER_CHROMA_AVX2_8xN ps, 64 + +%macro PROCESS_CHROMA_AVX2_W8_4R 0 + movq xm1, [r0] ; m1 = row 0 + movq xm2, [r0 + r1] ; m2 = row 1 + punpcklbw xm1, xm2 ; m1 = [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] + movq xm3, [r0 + r1 * 2] ; m3 = row 2 + punpcklbw xm2, xm3 ; m2 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] + vinserti128 m0, m1, xm2, 1 ; m0 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] + pmaddubsw m0, [r5] + movq xm4, [r0 + r4] ; m4 = row 3 + punpcklbw xm3, xm4 ; m3 = [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] + lea r0, [r0 + r1 * 4] + movq xm1, [r0] ; m1 = row 4 + punpcklbw xm4, xm1 ; m4 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] + vinserti128 m2, m3, xm4, 1 ; m2 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] + pmaddubsw m4, m2, [r5 + 1 * mmsize] + paddw m0, m4 + pmaddubsw m2, [r5] + movq xm3, [r0 + r1] ; m3 = row 5 + punpcklbw xm1, xm3 ; m1 = [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40] + movq xm4, [r0 + r1 * 2] ; m4 = row 6 + punpcklbw xm3, xm4 ; m3 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] + vinserti128 m1, m1, xm3, 1 ; m1 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] - [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40] + pmaddubsw m1, [r5 + 1 * mmsize] + paddw m2, m1 +%endmacro + +%macro FILTER_VER_CHROMA_AVX2_8x4 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x4, 4, 6, 5 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + PROCESS_CHROMA_AVX2_W8_4R +%ifidn %1,pp + lea r4, [r3 * 3] + mova m3, [pw_512] + pmulhrsw m0, m3 ; m0 = word: row 0, row 1 + pmulhrsw m2, m3 ; m2 = word: row 2, row 3 + packuswb m0, m2 + vextracti128 xm2, m0, 1 + movq [r2], xm0 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r4], xm2 +%else + add r3d, r3d + vbroadcasti128 m3, [pw_2000] + lea r4, [r3 * 3] + psubw m0, m3 ; m0 = word: row 0, row 1 + psubw m2, m3 ; m2 = word: row 2, row 3 + vextracti128 xm1, m0, 1 + vextracti128 xm4, m2, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r4], xm4 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_8x4 pp + FILTER_VER_CHROMA_AVX2_8x4 ps + +%macro FILTER_VER_CHROMA_AVX2_8x2 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x2, 4, 6, 4 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + + movq xm1, [r0] ; m1 = row 0 + movq xm2, [r0 + r1] ; m2 = row 1 + punpcklbw xm1, xm2 ; m1 = [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] + movq xm3, [r0 + r1 * 2] ; m3 = row 2 + punpcklbw xm2, xm3 ; m2 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] + vinserti128 m1, m1, xm2, 1 ; m1 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] + pmaddubsw m1, [r5] + movq xm2, [r0 + r4] ; m2 = row 3 + punpcklbw xm3, xm2 ; m3 = [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] + movq xm0, [r0 + r1 * 4] ; m0 = row 4 + punpcklbw xm2, xm0 ; m2 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] + vinserti128 m3, m3, xm2, 1 ; m3 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] + pmaddubsw m3, [r5 + 1 * mmsize] + paddw m1, m3 +%ifidn %1,pp + pmulhrsw m1, [pw_512] ; m1 = word: row 0, row 1 + packuswb m1, m1 + vextracti128 xm0, m1, 1 + movq [r2], xm1 + movq [r2 + r3], xm0 +%else + add r3d, r3d + psubw m1, [pw_2000] ; m1 = word: row 0, row 1 + vextracti128 xm0, m1, 1 + movu [r2], xm1 + movu [r2 + r3], xm0 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_8x2 pp + FILTER_VER_CHROMA_AVX2_8x2 ps + +%macro FILTER_VER_CHROMA_AVX2_6x8 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_6x8, 4, 6, 7 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + PROCESS_CHROMA_AVX2_W8_8R +%ifidn %1,pp + lea r4, [r3 * 3] + mova m3, [pw_512] + pmulhrsw m5, m3 ; m5 = word: row 0, row 1 + pmulhrsw m2, m3 ; m2 = word: row 2, row 3 + pmulhrsw m1, m3 ; m1 = word: row 4, row 5 + pmulhrsw m4, m3 ; m4 = word: row 6, row 7 + packuswb m5, m2 + packuswb m1, m4 + vextracti128 xm2, m5, 1 + vextracti128 xm4, m1, 1 + movd [r2], xm5 + pextrw [r2 + 4], xm5, 2 + movd [r2 + r3], xm2 + pextrw [r2 + r3 + 4], xm2, 2 + pextrd [r2 + r3 * 2], xm5, 2 + pextrw [r2 + r3 * 2 + 4], xm5, 6 + pextrd [r2 + r4], xm2, 2 + pextrw [r2 + r4 + 4], xm2, 6 + lea r2, [r2 + r3 * 4] + movd [r2], xm1 + pextrw [r2 + 4], xm1, 2 + movd [r2 + r3], xm4 + pextrw [r2 + r3 + 4], xm4, 2 + pextrd [r2 + r3 * 2], xm1, 2 + pextrw [r2 + r3 * 2 + 4], xm1, 6 + pextrd [r2 + r4], xm4, 2 + pextrw [r2 + r4 + 4], xm4, 6 +%else + add r3d, r3d + vbroadcasti128 m3, [pw_2000] + lea r4, [r3 * 3] + psubw m5, m3 ; m5 = word: row 0, row 1 + psubw m2, m3 ; m2 = word: row 2, row 3 + psubw m1, m3 ; m1 = word: row 4, row 5 + psubw m4, m3 ; m4 = word: row 6, row 7 + vextracti128 xm6, m5, 1 + vextracti128 xm3, m2, 1 + vextracti128 xm0, m1, 1 + movq [r2], xm5 + pextrd [r2 + 8], xm5, 2 + movq [r2 + r3], xm6 + pextrd [r2 + r3 + 8], xm6, 2 + movq [r2 + r3 * 2], xm2 + pextrd [r2 + r3 * 2 + 8], xm2, 2 + movq [r2 + r4], xm3 + pextrd [r2 + r4 + 8], xm3, 2 + lea r2, [r2 + r3 * 4] + movq [r2], xm1 + pextrd [r2 + 8], xm1, 2 + movq [r2 + r3], xm0 + pextrd [r2 + r3 + 8], xm0, 2 + movq [r2 + r3 * 2], xm4 + pextrd [r2 + r3 * 2 + 8], xm4, 2 + vextracti128 xm4, m4, 1 + movq [r2 + r4], xm4 + pextrd [r2 + r4 + 8], xm4, 2 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_6x8 pp + FILTER_VER_CHROMA_AVX2_6x8 ps + +;----------------------------------------------------------------------------- +;void interp_4tap_vert_pp_6x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W6_H4 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_pp_6x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m5, [r5 + r4 * 4] +%else + movd m5, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m6, m5, [tab_Vm] + pshufb m5, [tab_Vm + 16] + mova m4, [pw_512] + + mov r4d, %2 + lea r5, [3 * r1] + +.loop: + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + movq m3, [r0 + r5] + + punpcklbw m0, m1 + punpcklbw m1, m2 + punpcklbw m2, m3 + + pmaddubsw m0, m6 + pmaddubsw m7, m2, m5 + + paddw m0, m7 + + pmulhrsw m0, m4 + packuswb m0, m0 + movd [r2], m0 + pextrw [r2 + 4], m0, 2 + + lea r0, [r0 + 4 * r1] + + movq m0, [r0] + punpcklbw m3, m0 + + pmaddubsw m1, m6 + pmaddubsw m7, m3, m5 + + paddw m1, m7 + + pmulhrsw m1, m4 + packuswb m1, m1 + movd [r2 + r3], m1 + pextrw [r2 + r3 + 4], m1, 2 + + movq m1, [r0 + r1] + punpcklbw m7, m0, m1 + + pmaddubsw m2, m6 + pmaddubsw m7, m5 + + paddw m2, m7 + + pmulhrsw m2, m4 + packuswb m2, m2 + lea r2, [r2 + 2 * r3] + movd [r2], m2 + pextrw [r2 + 4], m2, 2 + + movq m2, [r0 + 2 * r1] + punpcklbw m1, m2 + + pmaddubsw m3, m6 + pmaddubsw m1, m5 + + paddw m3, m1 + + pmulhrsw m3, m4 + packuswb m3, m3 + + movd [r2 + r3], m3 + pextrw [r2 + r3 + 4], m3, 2 + + lea r2, [r2 + 2 * r3] + + sub r4, 4 + jnz .loop + RET +%endmacro + + FILTER_V4_W6_H4 6, 8 + + FILTER_V4_W6_H4 6, 16 + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W12_H2 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_pp_12x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] + + mov r4d, %2 + +.loop: + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + pmaddubsw m4, m1 + pmaddubsw m2, m1 + + lea r0, [r0 + 2 * r1] + movu m5, [r0] + movu m7, [r0 + r1] + + punpcklbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m4, m6 + + punpckhbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m2, m6 + + mova m6, [pw_512] + + pmulhrsw m4, m6 + pmulhrsw m2, m6 + + packuswb m4, m2 + + movh [r2], m4 + pextrd [r2 + 8], m4, 2 + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + pmaddubsw m4, m1 + pmaddubsw m3, m1 + + movu m5, [r0 + 2 * r1] + + punpcklbw m2, m7, m5 + punpckhbw m7, m5 + + pmaddubsw m2, m0 + pmaddubsw m7, m0 + + paddw m4, m2 + paddw m3, m7 + + pmulhrsw m4, m6 + pmulhrsw m3, m6 + + packuswb m4, m3 + + movh [r2 + r3], m4 + pextrd [r2 + r3 + 8], m4, 2 + + lea r2, [r2 + 2 * r3] + + sub r4, 2 + jnz .loop + RET +%endmacro + + FILTER_V4_W12_H2 12, 16 + + FILTER_V4_W12_H2 12, 32 + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_16x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W16_H2 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_pp_16x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] + + mov r4d, %2/2 + +.loop: + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + pmaddubsw m4, m1 + pmaddubsw m2, m1 + + lea r0, [r0 + 2 * r1] + movu m5, [r0] + movu m6, [r0 + r1] + + punpckhbw m7, m5, m6 + pmaddubsw m7, m0 + paddw m2, m7 + + punpcklbw m7, m5, m6 + pmaddubsw m7, m0 + paddw m4, m7 + + mova m7, [pw_512] + + pmulhrsw m4, m7 + pmulhrsw m2, m7 + + packuswb m4, m2 + + movu [r2], m4 + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + pmaddubsw m4, m1 + pmaddubsw m3, m1 + + movu m5, [r0 + 2 * r1] + + punpcklbw m2, m6, m5 + punpckhbw m6, m5 + + pmaddubsw m2, m0 + pmaddubsw m6, m0 + + paddw m4, m2 + paddw m3, m6 + + pmulhrsw m4, m7 + pmulhrsw m3, m7 + + packuswb m4, m3 + + movu [r2 + r3], m4 + + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loop + RET +%endmacro + + FILTER_V4_W16_H2 16, 4 + FILTER_V4_W16_H2 16, 8 + FILTER_V4_W16_H2 16, 12 + FILTER_V4_W16_H2 16, 16 + FILTER_V4_W16_H2 16, 32 + + FILTER_V4_W16_H2 16, 24 + FILTER_V4_W16_H2 16, 64 + +%macro FILTER_VER_CHROMA_AVX2_16x16 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_16x16, 4, 6, 15 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + mova m12, [r5] + mova m13, [r5 + mmsize] + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m14, [pw_512] +%else + add r3d, r3d + vbroadcasti128 m14, [pw_2000] +%endif + lea r5, [r3 * 3] + + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhbw xm2, xm0, xm1 + punpcklbw xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddubsw m0, m12 + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhbw xm3, xm1, xm2 + punpcklbw xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m1, m12 + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhbw xm4, xm2, xm3 + punpcklbw xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddubsw m4, m2, m13 + paddw m0, m4 + pmaddubsw m2, m12 + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhbw xm5, xm3, xm4 + punpcklbw xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddubsw m5, m3, m13 + paddw m1, m5 + pmaddubsw m3, m12 + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhbw xm6, xm4, xm5 + punpcklbw xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddubsw m6, m4, m13 + paddw m2, m6 + pmaddubsw m4, m12 + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhbw xm7, xm5, xm6 + punpcklbw xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddubsw m7, m5, m13 + paddw m3, m7 + pmaddubsw m5, m12 + movu xm7, [r0 + r4] ; m7 = row 7 + punpckhbw xm8, xm6, xm7 + punpcklbw xm6, xm7 + vinserti128 m6, m6, xm8, 1 + pmaddubsw m8, m6, m13 + paddw m4, m8 + pmaddubsw m6, m12 + lea r0, [r0 + r1 * 4] + movu xm8, [r0] ; m8 = row 8 + punpckhbw xm9, xm7, xm8 + punpcklbw xm7, xm8 + vinserti128 m7, m7, xm9, 1 + pmaddubsw m9, m7, m13 + paddw m5, m9 + pmaddubsw m7, m12 + movu xm9, [r0 + r1] ; m9 = row 9 + punpckhbw xm10, xm8, xm9 + punpcklbw xm8, xm9 + vinserti128 m8, m8, xm10, 1 + pmaddubsw m10, m8, m13 + paddw m6, m10 + pmaddubsw m8, m12 + movu xm10, [r0 + r1 * 2] ; m10 = row 10 + punpckhbw xm11, xm9, xm10 + punpcklbw xm9, xm10 + vinserti128 m9, m9, xm11, 1 + pmaddubsw m11, m9, m13 + paddw m7, m11 + pmaddubsw m9, m12 + +%ifidn %1,pp + pmulhrsw m0, m14 ; m0 = word: row 0 + pmulhrsw m1, m14 ; m1 = word: row 1 + pmulhrsw m2, m14 ; m2 = word: row 2 + pmulhrsw m3, m14 ; m3 = word: row 3 + pmulhrsw m4, m14 ; m4 = word: row 4 + pmulhrsw m5, m14 ; m5 = word: row 5 + pmulhrsw m6, m14 ; m6 = word: row 6 + pmulhrsw m7, m14 ; m7 = word: row 7 + packuswb m0, m1 + packuswb m2, m3 + packuswb m4, m5 + packuswb m6, m7 + vpermq m0, m0, 11011000b + vpermq m2, m2, 11011000b + vpermq m4, m4, 11011000b + vpermq m6, m6, 11011000b + vextracti128 xm1, m0, 1 + vextracti128 xm3, m2, 1 + vextracti128 xm5, m4, 1 + vextracti128 xm7, m6, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r5], xm3 + lea r2, [r2 + r3 * 4] + movu [r2], xm4 + movu [r2 + r3], xm5 + movu [r2 + r3 * 2], xm6 + movu [r2 + r5], xm7 +%else + psubw m0, m14 ; m0 = word: row 0 + psubw m1, m14 ; m1 = word: row 1 + psubw m2, m14 ; m2 = word: row 2 + psubw m3, m14 ; m3 = word: row 3 + psubw m4, m14 ; m4 = word: row 4 + psubw m5, m14 ; m5 = word: row 5 + psubw m6, m14 ; m6 = word: row 6 + psubw m7, m14 ; m7 = word: row 7 + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r5], m3 + lea r2, [r2 + r3 * 4] + movu [r2], m4 + movu [r2 + r3], m5 + movu [r2 + r3 * 2], m6 + movu [r2 + r5], m7 +%endif + lea r2, [r2 + r3 * 4] + + movu xm11, [r0 + r4] ; m11 = row 11 + punpckhbw xm6, xm10, xm11 + punpcklbw xm10, xm11 + vinserti128 m10, m10, xm6, 1 + pmaddubsw m6, m10, m13 + paddw m8, m6 + pmaddubsw m10, m12 + lea r0, [r0 + r1 * 4] + movu xm6, [r0] ; m6 = row 12 + punpckhbw xm7, xm11, xm6 + punpcklbw xm11, xm6 + vinserti128 m11, m11, xm7, 1 + pmaddubsw m7, m11, m13 + paddw m9, m7 + pmaddubsw m11, m12 + + movu xm7, [r0 + r1] ; m7 = row 13 + punpckhbw xm0, xm6, xm7 + punpcklbw xm6, xm7 + vinserti128 m6, m6, xm0, 1 + pmaddubsw m0, m6, m13 + paddw m10, m0 + pmaddubsw m6, m12 + movu xm0, [r0 + r1 * 2] ; m0 = row 14 + punpckhbw xm1, xm7, xm0 + punpcklbw xm7, xm0 + vinserti128 m7, m7, xm1, 1 + pmaddubsw m1, m7, m13 + paddw m11, m1 + pmaddubsw m7, m12 + movu xm1, [r0 + r4] ; m1 = row 15 + punpckhbw xm2, xm0, xm1 + punpcklbw xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddubsw m2, m0, m13 + paddw m6, m2 + pmaddubsw m0, m12 + lea r0, [r0 + r1 * 4] + movu xm2, [r0] ; m2 = row 16 + punpckhbw xm3, xm1, xm2 + punpcklbw xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m3, m1, m13 + paddw m7, m3 + pmaddubsw m1, m12 + movu xm3, [r0 + r1] ; m3 = row 17 + punpckhbw xm4, xm2, xm3 + punpcklbw xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddubsw m2, m13 + paddw m0, m2 + movu xm4, [r0 + r1 * 2] ; m4 = row 18 + punpckhbw xm5, xm3, xm4 + punpcklbw xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddubsw m3, m13 + paddw m1, m3 + +%ifidn %1,pp + pmulhrsw m8, m14 ; m8 = word: row 8 + pmulhrsw m9, m14 ; m9 = word: row 9 + pmulhrsw m10, m14 ; m10 = word: row 10 + pmulhrsw m11, m14 ; m11 = word: row 11 + pmulhrsw m6, m14 ; m6 = word: row 12 + pmulhrsw m7, m14 ; m7 = word: row 13 + pmulhrsw m0, m14 ; m0 = word: row 14 + pmulhrsw m1, m14 ; m1 = word: row 15 + packuswb m8, m9 + packuswb m10, m11 + packuswb m6, m7 + packuswb m0, m1 + vpermq m8, m8, 11011000b + vpermq m10, m10, 11011000b + vpermq m6, m6, 11011000b + vpermq m0, m0, 11011000b + vextracti128 xm9, m8, 1 + vextracti128 xm11, m10, 1 + vextracti128 xm7, m6, 1 + vextracti128 xm1, m0, 1 + movu [r2], xm8 + movu [r2 + r3], xm9 + movu [r2 + r3 * 2], xm10 + movu [r2 + r5], xm11 + lea r2, [r2 + r3 * 4] + movu [r2], xm6 + movu [r2 + r3], xm7 + movu [r2 + r3 * 2], xm0 + movu [r2 + r5], xm1 +%else + psubw m8, m14 ; m8 = word: row 8 + psubw m9, m14 ; m9 = word: row 9 + psubw m10, m14 ; m10 = word: row 10 + psubw m11, m14 ; m11 = word: row 11 + psubw m6, m14 ; m6 = word: row 12 + psubw m7, m14 ; m7 = word: row 13 + psubw m0, m14 ; m0 = word: row 14 + psubw m1, m14 ; m1 = word: row 15 + movu [r2], m8 + movu [r2 + r3], m9 + movu [r2 + r3 * 2], m10 + movu [r2 + r5], m11 + lea r2, [r2 + r3 * 4] + movu [r2], m6 + movu [r2 + r3], m7 + movu [r2 + r3 * 2], m0 + movu [r2 + r5], m1 +%endif + RET +%endif +%endmacro + + FILTER_VER_CHROMA_AVX2_16x16 pp + FILTER_VER_CHROMA_AVX2_16x16 ps +%macro FILTER_VER_CHROMA_AVX2_16x8 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_16x8, 4, 7, 7 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m6, [pw_512] +%else + add r3d, r3d + mova m6, [pw_2000] +%endif + lea r6, [r3 * 3] + + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhbw xm2, xm0, xm1 + punpcklbw xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddubsw m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhbw xm3, xm1, xm2 + punpcklbw xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhbw xm4, xm2, xm3 + punpcklbw xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddubsw m4, m2, [r5 + mmsize] + paddw m0, m4 + pmaddubsw m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhbw xm5, xm3, xm4 + punpcklbw xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddubsw m5, m3, [r5 + mmsize] + paddw m1, m5 + pmaddubsw m3, [r5] +%ifidn %1,pp + pmulhrsw m0, m6 ; m0 = word: row 0 + pmulhrsw m1, m6 ; m1 = word: row 1 + packuswb m0, m1 + vpermq m0, m0, 11011000b + vextracti128 xm1, m0, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 +%else + psubw m0, m6 ; m0 = word: row 0 + psubw m1, m6 ; m1 = word: row 1 + movu [r2], m0 + movu [r2 + r3], m1 +%endif + + movu xm0, [r0 + r1] ; m0 = row 5 + punpckhbw xm1, xm4, xm0 + punpcklbw xm4, xm0 + vinserti128 m4, m4, xm1, 1 + pmaddubsw m1, m4, [r5 + mmsize] + paddw m2, m1 + pmaddubsw m4, [r5] + movu xm1, [r0 + r1 * 2] ; m1 = row 6 + punpckhbw xm5, xm0, xm1 + punpcklbw xm0, xm1 + vinserti128 m0, m0, xm5, 1 + pmaddubsw m5, m0, [r5 + mmsize] + paddw m3, m5 + pmaddubsw m0, [r5] +%ifidn %1,pp + pmulhrsw m2, m6 ; m2 = word: row 2 + pmulhrsw m3, m6 ; m3 = word: row 3 + packuswb m2, m3 + vpermq m2, m2, 11011000b + vextracti128 xm3, m2, 1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm3 +%else + psubw m2, m6 ; m2 = word: row 2 + psubw m3, m6 ; m3 = word: row 3 + movu [r2 + r3 * 2], m2 + movu [r2 + r6], m3 +%endif + + movu xm2, [r0 + r4] ; m2 = row 7 + punpckhbw xm3, xm1, xm2 + punpcklbw xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m3, m1, [r5 + mmsize] + paddw m4, m3 + pmaddubsw m1, [r5] + lea r0, [r0 + r1 * 4] + movu xm3, [r0] ; m3 = row 8 + punpckhbw xm5, xm2, xm3 + punpcklbw xm2, xm3 + vinserti128 m2, m2, xm5, 1 + pmaddubsw m5, m2, [r5 + mmsize] + paddw m0, m5 + pmaddubsw m2, [r5] + lea r2, [r2 + r3 * 4] +%ifidn %1,pp + pmulhrsw m4, m6 ; m4 = word: row 4 + pmulhrsw m0, m6 ; m0 = word: row 5 + packuswb m4, m0 + vpermq m4, m4, 11011000b + vextracti128 xm0, m4, 1 + movu [r2], xm4 + movu [r2 + r3], xm0 +%else + psubw m4, m6 ; m4 = word: row 4 + psubw m0, m6 ; m0 = word: row 5 + movu [r2], m4 + movu [r2 + r3], m0 +%endif + + movu xm5, [r0 + r1] ; m5 = row 9 + punpckhbw xm4, xm3, xm5 + punpcklbw xm3, xm5 + vinserti128 m3, m3, xm4, 1 + pmaddubsw m3, [r5 + mmsize] + paddw m1, m3 + movu xm4, [r0 + r1 * 2] ; m4 = row 10 + punpckhbw xm0, xm5, xm4 + punpcklbw xm5, xm4 + vinserti128 m5, m5, xm0, 1 + pmaddubsw m5, [r5 + mmsize] + paddw m2, m5 +%ifidn %1,pp + pmulhrsw m1, m6 ; m1 = word: row 6 + pmulhrsw m2, m6 ; m2 = word: row 7 + packuswb m1, m2 + vpermq m1, m1, 11011000b + vextracti128 xm2, m1, 1 + movu [r2 + r3 * 2], xm1 + movu [r2 + r6], xm2 +%else + psubw m1, m6 ; m1 = word: row 6 + psubw m2, m6 ; m2 = word: row 7 + movu [r2 + r3 * 2], m1 + movu [r2 + r6], m2 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_16x8 pp + FILTER_VER_CHROMA_AVX2_16x8 ps + +%macro FILTER_VER_CHROMA_AVX2_16x12 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_16x12, 4, 6, 10 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + mova m8, [r5] + mova m9, [r5 + mmsize] + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m7, [pw_512] +%else + add r3d, r3d + vbroadcasti128 m7, [pw_2000] +%endif + lea r5, [r3 * 3] + + movu xm0, [r0] + vinserti128 m0, m0, [r0 + r1 * 2], 1 + movu xm1, [r0 + r1] + vinserti128 m1, m1, [r0 + r4], 1 + + punpcklbw m2, m0, m1 + punpckhbw m3, m0, m1 + vperm2i128 m4, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + pmaddubsw m4, m8 + pmaddubsw m3, m2, m9 + paddw m4, m3 + pmaddubsw m2, m8 + + vextracti128 xm0, m0, 1 + lea r0, [r0 + r1 * 4] + vinserti128 m0, m0, [r0], 1 + + punpcklbw m5, m1, m0 + punpckhbw m3, m1, m0 + vperm2i128 m6, m5, m3, 0x20 + vperm2i128 m5, m5, m3, 0x31 + pmaddubsw m6, m8 + pmaddubsw m3, m5, m9 + paddw m6, m3 + pmaddubsw m5, m8 +%ifidn %1,pp + pmulhrsw m4, m7 ; m4 = word: row 0 + pmulhrsw m6, m7 ; m6 = word: row 1 + packuswb m4, m6 + vpermq m4, m4, 11011000b + vextracti128 xm6, m4, 1 + movu [r2], xm4 + movu [r2 + r3], xm6 +%else + psubw m4, m7 ; m4 = word: row 0 + psubw m6, m7 ; m6 = word: row 1 + movu [r2], m4 + movu [r2 + r3], m6 +%endif + + movu xm4, [r0 + r1 * 2] + vinserti128 m4, m4, [r0 + r1], 1 + vextracti128 xm1, m4, 1 + vinserti128 m0, m0, xm1, 0 + + punpcklbw m6, m0, m4 + punpckhbw m1, m0, m4 + vperm2i128 m0, m6, m1, 0x20 + vperm2i128 m6, m6, m1, 0x31 + pmaddubsw m1, m0, m9 + paddw m5, m1 + pmaddubsw m0, m8 + pmaddubsw m1, m6, m9 + paddw m2, m1 + pmaddubsw m6, m8 + +%ifidn %1,pp + pmulhrsw m2, m7 ; m2 = word: row 2 + pmulhrsw m5, m7 ; m5 = word: row 3 + packuswb m2, m5 + vpermq m2, m2, 11011000b + vextracti128 xm5, m2, 1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r5], xm5 +%else + psubw m2, m7 ; m2 = word: row 2 + psubw m5, m7 ; m5 = word: row 3 + movu [r2 + r3 * 2], m2 + movu [r2 + r5], m5 +%endif + lea r2, [r2 + r3 * 4] + + movu xm1, [r0 + r4] + lea r0, [r0 + r1 * 4] + vinserti128 m1, m1, [r0], 1 + vinserti128 m4, m4, xm1, 1 + + punpcklbw m2, m4, m1 + punpckhbw m5, m4, m1 + vperm2i128 m3, m2, m5, 0x20 + vperm2i128 m2, m2, m5, 0x31 + pmaddubsw m5, m3, m9 + paddw m6, m5 + pmaddubsw m3, m8 + pmaddubsw m5, m2, m9 + paddw m0, m5 + pmaddubsw m2, m8 + +%ifidn %1,pp + pmulhrsw m6, m7 ; m6 = word: row 4 + pmulhrsw m0, m7 ; m0 = word: row 5 + packuswb m6, m0 + vpermq m6, m6, 11011000b + vextracti128 xm0, m6, 1 + movu [r2], xm6 + movu [r2 + r3], xm0 +%else + psubw m6, m7 ; m6 = word: row 4 + psubw m0, m7 ; m0 = word: row 5 + movu [r2], m6 + movu [r2 + r3], m0 +%endif + + movu xm6, [r0 + r1 * 2] + vinserti128 m6, m6, [r0 + r1], 1 + vextracti128 xm0, m6, 1 + vinserti128 m1, m1, xm0, 0 + + punpcklbw m4, m1, m6 + punpckhbw m5, m1, m6 + vperm2i128 m0, m4, m5, 0x20 + vperm2i128 m5, m4, m5, 0x31 + pmaddubsw m4, m0, m9 + paddw m2, m4 + pmaddubsw m0, m8 + pmaddubsw m4, m5, m9 + paddw m3, m4 + pmaddubsw m5, m8 + +%ifidn %1,pp + pmulhrsw m3, m7 ; m3 = word: row 6 + pmulhrsw m2, m7 ; m2 = word: row 7 + packuswb m3, m2 + vpermq m3, m3, 11011000b + vextracti128 xm2, m3, 1 + movu [r2 + r3 * 2], xm3 + movu [r2 + r5], xm2 +%else + psubw m3, m7 ; m3 = word: row 6 + psubw m2, m7 ; m2 = word: row 7 + movu [r2 + r3 * 2], m3 + movu [r2 + r5], m2 +%endif + lea r2, [r2 + r3 * 4] + + movu xm3, [r0 + r4] + lea r0, [r0 + r1 * 4] + vinserti128 m3, m3, [r0], 1 + vinserti128 m6, m6, xm3, 1 + + punpcklbw m2, m6, m3 + punpckhbw m1, m6, m3 + vperm2i128 m4, m2, m1, 0x20 + vperm2i128 m2, m2, m1, 0x31 + pmaddubsw m1, m4, m9 + paddw m5, m1 + pmaddubsw m4, m8 + pmaddubsw m1, m2, m9 + paddw m0, m1 + pmaddubsw m2, m8 + +%ifidn %1,pp + pmulhrsw m5, m7 ; m5 = word: row 8 + pmulhrsw m0, m7 ; m0 = word: row 9 + packuswb m5, m0 + vpermq m5, m5, 11011000b + vextracti128 xm0, m5, 1 + movu [r2], xm5 + movu [r2 + r3], xm0 +%else + psubw m5, m7 ; m5 = word: row 8 + psubw m0, m7 ; m0 = word: row 9 + movu [r2], m5 + movu [r2 + r3], m0 +%endif + + movu xm5, [r0 + r1 * 2] + vinserti128 m5, m5, [r0 + r1], 1 + vextracti128 xm0, m5, 1 + vinserti128 m3, m3, xm0, 0 + + punpcklbw m1, m3, m5 + punpckhbw m0, m3, m5 + vperm2i128 m6, m1, m0, 0x20 + vperm2i128 m0, m1, m0, 0x31 + pmaddubsw m1, m6, m9 + paddw m2, m1 + pmaddubsw m1, m0, m9 + paddw m4, m1 + +%ifidn %1,pp + pmulhrsw m4, m7 ; m4 = word: row 10 + pmulhrsw m2, m7 ; m2 = word: row 11 + packuswb m4, m2 + vpermq m4, m4, 11011000b + vextracti128 xm2, m4, 1 + movu [r2 + r3 * 2], xm4 + movu [r2 + r5], xm2 +%else + psubw m4, m7 ; m4 = word: row 10 + psubw m2, m7 ; m2 = word: row 11 + movu [r2 + r3 * 2], m4 + movu [r2 + r5], m2 +%endif + RET +%endif +%endmacro + + FILTER_VER_CHROMA_AVX2_16x12 pp + FILTER_VER_CHROMA_AVX2_16x12 ps + +%macro FILTER_VER_CHROMA_AVX2_16xN 2 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_16x%2, 4, 8, 8 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m7, [pw_512] +%else + add r3d, r3d + mova m7, [pw_2000] +%endif + lea r6, [r3 * 3] + mov r7d, %2 / 16 +.loopH: + movu xm0, [r0] + vinserti128 m0, m0, [r0 + r1 * 2], 1 + movu xm1, [r0 + r1] + vinserti128 m1, m1, [r0 + r4], 1 + + punpcklbw m2, m0, m1 + punpckhbw m3, m0, m1 + vperm2i128 m4, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + pmaddubsw m4, [r5] + pmaddubsw m3, m2, [r5 + mmsize] + paddw m4, m3 + pmaddubsw m2, [r5] + + vextracti128 xm0, m0, 1 + lea r0, [r0 + r1 * 4] + vinserti128 m0, m0, [r0], 1 + + punpcklbw m5, m1, m0 + punpckhbw m3, m1, m0 + vperm2i128 m6, m5, m3, 0x20 + vperm2i128 m5, m5, m3, 0x31 + pmaddubsw m6, [r5] + pmaddubsw m3, m5, [r5 + mmsize] + paddw m6, m3 + pmaddubsw m5, [r5] +%ifidn %1,pp + pmulhrsw m4, m7 ; m4 = word: row 0 + pmulhrsw m6, m7 ; m6 = word: row 1 + packuswb m4, m6 + vpermq m4, m4, 11011000b + vextracti128 xm6, m4, 1 + movu [r2], xm4 + movu [r2 + r3], xm6 +%else + psubw m4, m7 ; m4 = word: row 0 + psubw m6, m7 ; m6 = word: row 1 + movu [r2], m4 + movu [r2 + r3], m6 +%endif + + movu xm4, [r0 + r1 * 2] + vinserti128 m4, m4, [r0 + r1], 1 + vextracti128 xm1, m4, 1 + vinserti128 m0, m0, xm1, 0 + + punpcklbw m6, m0, m4 + punpckhbw m1, m0, m4 + vperm2i128 m0, m6, m1, 0x20 + vperm2i128 m6, m6, m1, 0x31 + pmaddubsw m1, m0, [r5 + mmsize] + paddw m5, m1 + pmaddubsw m0, [r5] + pmaddubsw m1, m6, [r5 + mmsize] + paddw m2, m1 + pmaddubsw m6, [r5] + +%ifidn %1,pp + pmulhrsw m2, m7 ; m2 = word: row 2 + pmulhrsw m5, m7 ; m5 = word: row 3 + packuswb m2, m5 + vpermq m2, m2, 11011000b + vextracti128 xm5, m2, 1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm5 +%else + psubw m2, m7 ; m2 = word: row 2 + psubw m5, m7 ; m5 = word: row 3 + movu [r2 + r3 * 2], m2 + movu [r2 + r6], m5 +%endif + lea r2, [r2 + r3 * 4] + + movu xm1, [r0 + r4] + lea r0, [r0 + r1 * 4] + vinserti128 m1, m1, [r0], 1 + vinserti128 m4, m4, xm1, 1 + + punpcklbw m2, m4, m1 + punpckhbw m5, m4, m1 + vperm2i128 m3, m2, m5, 0x20 + vperm2i128 m2, m2, m5, 0x31 + pmaddubsw m5, m3, [r5 + mmsize] + paddw m6, m5 + pmaddubsw m3, [r5] + pmaddubsw m5, m2, [r5 + mmsize] + paddw m0, m5 + pmaddubsw m2, [r5] + +%ifidn %1,pp + pmulhrsw m6, m7 ; m6 = word: row 4 + pmulhrsw m0, m7 ; m0 = word: row 5 + packuswb m6, m0 + vpermq m6, m6, 11011000b + vextracti128 xm0, m6, 1 + movu [r2], xm6 + movu [r2 + r3], xm0 +%else + psubw m6, m7 ; m6 = word: row 4 + psubw m0, m7 ; m0 = word: row 5 + movu [r2], m6 + movu [r2 + r3], m0 +%endif + + movu xm6, [r0 + r1 * 2] + vinserti128 m6, m6, [r0 + r1], 1 + vextracti128 xm0, m6, 1 + vinserti128 m1, m1, xm0, 0 + + punpcklbw m4, m1, m6 + punpckhbw m5, m1, m6 + vperm2i128 m0, m4, m5, 0x20 + vperm2i128 m5, m4, m5, 0x31 + pmaddubsw m4, m0, [r5 + mmsize] + paddw m2, m4 + pmaddubsw m0, [r5] + pmaddubsw m4, m5, [r5 + mmsize] + paddw m3, m4 + pmaddubsw m5, [r5] + +%ifidn %1,pp + pmulhrsw m3, m7 ; m3 = word: row 6 + pmulhrsw m2, m7 ; m2 = word: row 7 + packuswb m3, m2 + vpermq m3, m3, 11011000b + vextracti128 xm2, m3, 1 + movu [r2 + r3 * 2], xm3 + movu [r2 + r6], xm2 +%else + psubw m3, m7 ; m3 = word: row 6 + psubw m2, m7 ; m2 = word: row 7 + movu [r2 + r3 * 2], m3 + movu [r2 + r6], m2 +%endif + lea r2, [r2 + r3 * 4] + + movu xm3, [r0 + r4] + lea r0, [r0 + r1 * 4] + vinserti128 m3, m3, [r0], 1 + vinserti128 m6, m6, xm3, 1 + + punpcklbw m2, m6, m3 + punpckhbw m1, m6, m3 + vperm2i128 m4, m2, m1, 0x20 + vperm2i128 m2, m2, m1, 0x31 + pmaddubsw m1, m4, [r5 + mmsize] + paddw m5, m1 + pmaddubsw m4, [r5] + pmaddubsw m1, m2, [r5 + mmsize] + paddw m0, m1 + pmaddubsw m2, [r5] + +%ifidn %1,pp + pmulhrsw m5, m7 ; m5 = word: row 8 + pmulhrsw m0, m7 ; m0 = word: row 9 + packuswb m5, m0 + vpermq m5, m5, 11011000b + vextracti128 xm0, m5, 1 + movu [r2], xm5 + movu [r2 + r3], xm0 +%else + psubw m5, m7 ; m5 = word: row 8 + psubw m0, m7 ; m0 = word: row 9 + movu [r2], m5 + movu [r2 + r3], m0 +%endif + + movu xm5, [r0 + r1 * 2] + vinserti128 m5, m5, [r0 + r1], 1 + vextracti128 xm0, m5, 1 + vinserti128 m3, m3, xm0, 0 + + punpcklbw m1, m3, m5 + punpckhbw m0, m3, m5 + vperm2i128 m6, m1, m0, 0x20 + vperm2i128 m0, m1, m0, 0x31 + pmaddubsw m1, m6, [r5 + mmsize] + paddw m2, m1 + pmaddubsw m6, [r5] + pmaddubsw m1, m0, [r5 + mmsize] + paddw m4, m1 + pmaddubsw m0, [r5] + +%ifidn %1,pp + pmulhrsw m4, m7 ; m4 = word: row 10 + pmulhrsw m2, m7 ; m2 = word: row 11 + packuswb m4, m2 + vpermq m4, m4, 11011000b + vextracti128 xm2, m4, 1 + movu [r2 + r3 * 2], xm4 + movu [r2 + r6], xm2 +%else + psubw m4, m7 ; m4 = word: row 10 + psubw m2, m7 ; m2 = word: row 11 + movu [r2 + r3 * 2], m4 + movu [r2 + r6], m2 +%endif + lea r2, [r2 + r3 * 4] + + movu xm3, [r0 + r4] + lea r0, [r0 + r1 * 4] + vinserti128 m3, m3, [r0], 1 + vinserti128 m5, m5, xm3, 1 + + punpcklbw m2, m5, m3 + punpckhbw m1, m5, m3 + vperm2i128 m4, m2, m1, 0x20 + vperm2i128 m2, m2, m1, 0x31 + pmaddubsw m1, m4, [r5 + mmsize] + paddw m0, m1 + pmaddubsw m4, [r5] + pmaddubsw m1, m2, [r5 + mmsize] + paddw m6, m1 + pmaddubsw m2, [r5] + +%ifidn %1,pp + pmulhrsw m0, m7 ; m0 = word: row 12 + pmulhrsw m6, m7 ; m6 = word: row 13 + packuswb m0, m6 + vpermq m0, m0, 11011000b + vextracti128 xm6, m0, 1 + movu [r2], xm0 + movu [r2 + r3], xm6 +%else + psubw m0, m7 ; m0 = word: row 12 + psubw m6, m7 ; m6 = word: row 13 + movu [r2], m0 + movu [r2 + r3], m6 +%endif + + movu xm5, [r0 + r1 * 2] + vinserti128 m5, m5, [r0 + r1], 1 + vextracti128 xm0, m5, 1 + vinserti128 m3, m3, xm0, 0 + + punpcklbw m1, m3, m5 + punpckhbw m0, m3, m5 + vperm2i128 m6, m1, m0, 0x20 + vperm2i128 m0, m1, m0, 0x31 + pmaddubsw m6, [r5 + mmsize] + paddw m2, m6 + pmaddubsw m0, [r5 + mmsize] + paddw m4, m0 + +%ifidn %1,pp + pmulhrsw m4, m7 ; m4 = word: row 14 + pmulhrsw m2, m7 ; m2 = word: row 15 + packuswb m4, m2 + vpermq m4, m4, 11011000b + vextracti128 xm2, m4, 1 + movu [r2 + r3 * 2], xm4 + movu [r2 + r6], xm2 +%else + psubw m4, m7 ; m4 = word: row 14 + psubw m2, m7 ; m2 = word: row 15 + movu [r2 + r3 * 2], m4 + movu [r2 + r6], m2 +%endif + lea r2, [r2 + r3 * 4] + dec r7d + jnz .loopH + RET +%endif +%endmacro + + FILTER_VER_CHROMA_AVX2_16xN pp, 32 + FILTER_VER_CHROMA_AVX2_16xN ps, 32 + FILTER_VER_CHROMA_AVX2_16xN pp, 64 + FILTER_VER_CHROMA_AVX2_16xN ps, 64 + +%macro FILTER_VER_CHROMA_AVX2_16x24 1 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_16x24, 4, 6, 15 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + mova m12, [r5] + mova m13, [r5 + mmsize] + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m14, [pw_512] +%else + add r3d, r3d + vbroadcasti128 m14, [pw_2000] +%endif + lea r5, [r3 * 3] + + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhbw xm2, xm0, xm1 + punpcklbw xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddubsw m0, m12 + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhbw xm3, xm1, xm2 + punpcklbw xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m1, m12 + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhbw xm4, xm2, xm3 + punpcklbw xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddubsw m4, m2, m13 + paddw m0, m4 + pmaddubsw m2, m12 + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhbw xm5, xm3, xm4 + punpcklbw xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddubsw m5, m3, m13 + paddw m1, m5 + pmaddubsw m3, m12 + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhbw xm6, xm4, xm5 + punpcklbw xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddubsw m6, m4, m13 + paddw m2, m6 + pmaddubsw m4, m12 + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhbw xm7, xm5, xm6 + punpcklbw xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddubsw m7, m5, m13 + paddw m3, m7 + pmaddubsw m5, m12 + movu xm7, [r0 + r4] ; m7 = row 7 + punpckhbw xm8, xm6, xm7 + punpcklbw xm6, xm7 + vinserti128 m6, m6, xm8, 1 + pmaddubsw m8, m6, m13 + paddw m4, m8 + pmaddubsw m6, m12 + lea r0, [r0 + r1 * 4] + movu xm8, [r0] ; m8 = row 8 + punpckhbw xm9, xm7, xm8 + punpcklbw xm7, xm8 + vinserti128 m7, m7, xm9, 1 + pmaddubsw m9, m7, m13 + paddw m5, m9 + pmaddubsw m7, m12 + movu xm9, [r0 + r1] ; m9 = row 9 + punpckhbw xm10, xm8, xm9 + punpcklbw xm8, xm9 + vinserti128 m8, m8, xm10, 1 + pmaddubsw m10, m8, m13 + paddw m6, m10 + pmaddubsw m8, m12 + movu xm10, [r0 + r1 * 2] ; m10 = row 10 + punpckhbw xm11, xm9, xm10 + punpcklbw xm9, xm10 + vinserti128 m9, m9, xm11, 1 + pmaddubsw m11, m9, m13 + paddw m7, m11 + pmaddubsw m9, m12 + +%ifidn %1,pp + pmulhrsw m0, m14 ; m0 = word: row 0 + pmulhrsw m1, m14 ; m1 = word: row 1 + pmulhrsw m2, m14 ; m2 = word: row 2 + pmulhrsw m3, m14 ; m3 = word: row 3 + pmulhrsw m4, m14 ; m4 = word: row 4 + pmulhrsw m5, m14 ; m5 = word: row 5 + pmulhrsw m6, m14 ; m6 = word: row 6 + pmulhrsw m7, m14 ; m7 = word: row 7 + packuswb m0, m1 + packuswb m2, m3 + packuswb m4, m5 + packuswb m6, m7 + vpermq m0, m0, q3120 + vpermq m2, m2, q3120 + vpermq m4, m4, q3120 + vpermq m6, m6, q3120 + vextracti128 xm1, m0, 1 + vextracti128 xm3, m2, 1 + vextracti128 xm5, m4, 1 + vextracti128 xm7, m6, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r5], xm3 + lea r2, [r2 + r3 * 4] + movu [r2], xm4 + movu [r2 + r3], xm5 + movu [r2 + r3 * 2], xm6 + movu [r2 + r5], xm7 +%else + psubw m0, m14 ; m0 = word: row 0 + psubw m1, m14 ; m1 = word: row 1 + psubw m2, m14 ; m2 = word: row 2 + psubw m3, m14 ; m3 = word: row 3 + psubw m4, m14 ; m4 = word: row 4 + psubw m5, m14 ; m5 = word: row 5 + psubw m6, m14 ; m6 = word: row 6 + psubw m7, m14 ; m7 = word: row 7 + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r5], m3 + lea r2, [r2 + r3 * 4] + movu [r2], m4 + movu [r2 + r3], m5 + movu [r2 + r3 * 2], m6 + movu [r2 + r5], m7 +%endif + lea r2, [r2 + r3 * 4] + + movu xm11, [r0 + r4] ; m11 = row 11 + punpckhbw xm6, xm10, xm11 + punpcklbw xm10, xm11 + vinserti128 m10, m10, xm6, 1 + pmaddubsw m6, m10, m13 + paddw m8, m6 + pmaddubsw m10, m12 + lea r0, [r0 + r1 * 4] + movu xm6, [r0] ; m6 = row 12 + punpckhbw xm7, xm11, xm6 + punpcklbw xm11, xm6 + vinserti128 m11, m11, xm7, 1 + pmaddubsw m7, m11, m13 + paddw m9, m7 + pmaddubsw m11, m12 + + movu xm7, [r0 + r1] ; m7 = row 13 + punpckhbw xm0, xm6, xm7 + punpcklbw xm6, xm7 + vinserti128 m6, m6, xm0, 1 + pmaddubsw m0, m6, m13 + paddw m10, m0 + pmaddubsw m6, m12 + movu xm0, [r0 + r1 * 2] ; m0 = row 14 + punpckhbw xm1, xm7, xm0 + punpcklbw xm7, xm0 + vinserti128 m7, m7, xm1, 1 + pmaddubsw m1, m7, m13 + paddw m11, m1 + pmaddubsw m7, m12 + movu xm1, [r0 + r4] ; m1 = row 15 + punpckhbw xm2, xm0, xm1 + punpcklbw xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddubsw m2, m0, m13 + paddw m6, m2 + pmaddubsw m0, m12 + lea r0, [r0 + r1 * 4] + movu xm2, [r0] ; m2 = row 16 + punpckhbw xm3, xm1, xm2 + punpcklbw xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m3, m1, m13 + paddw m7, m3 + pmaddubsw m1, m12 + movu xm3, [r0 + r1] ; m3 = row 17 + punpckhbw xm4, xm2, xm3 + punpcklbw xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddubsw m4, m2, m13 + paddw m0, m4 + pmaddubsw m2, m12 + movu xm4, [r0 + r1 * 2] ; m4 = row 18 + punpckhbw xm5, xm3, xm4 + punpcklbw xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddubsw m5, m3, m13 + paddw m1, m5 + pmaddubsw m3, m12 + +%ifidn %1,pp + pmulhrsw m8, m14 ; m8 = word: row 8 + pmulhrsw m9, m14 ; m9 = word: row 9 + pmulhrsw m10, m14 ; m10 = word: row 10 + pmulhrsw m11, m14 ; m11 = word: row 11 + pmulhrsw m6, m14 ; m6 = word: row 12 + pmulhrsw m7, m14 ; m7 = word: row 13 + pmulhrsw m0, m14 ; m0 = word: row 14 + pmulhrsw m1, m14 ; m1 = word: row 15 + packuswb m8, m9 + packuswb m10, m11 + packuswb m6, m7 + packuswb m0, m1 + vpermq m8, m8, q3120 + vpermq m10, m10, q3120 + vpermq m6, m6, q3120 + vpermq m0, m0, q3120 + vextracti128 xm9, m8, 1 + vextracti128 xm11, m10, 1 + vextracti128 xm7, m6, 1 + vextracti128 xm1, m0, 1 + movu [r2], xm8 + movu [r2 + r3], xm9 + movu [r2 + r3 * 2], xm10 + movu [r2 + r5], xm11 + lea r2, [r2 + r3 * 4] + movu [r2], xm6 + movu [r2 + r3], xm7 + movu [r2 + r3 * 2], xm0 + movu [r2 + r5], xm1 +%else + psubw m8, m14 ; m8 = word: row 8 + psubw m9, m14 ; m9 = word: row 9 + psubw m10, m14 ; m10 = word: row 10 + psubw m11, m14 ; m11 = word: row 11 + psubw m6, m14 ; m6 = word: row 12 + psubw m7, m14 ; m7 = word: row 13 + psubw m0, m14 ; m0 = word: row 14 + psubw m1, m14 ; m1 = word: row 15 + movu [r2], m8 + movu [r2 + r3], m9 + movu [r2 + r3 * 2], m10 + movu [r2 + r5], m11 + lea r2, [r2 + r3 * 4] + movu [r2], m6 + movu [r2 + r3], m7 + movu [r2 + r3 * 2], m0 + movu [r2 + r5], m1 +%endif + lea r2, [r2 + r3 * 4] + + movu xm5, [r0 + r4] ; m5 = row 19 + punpckhbw xm6, xm4, xm5 + punpcklbw xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddubsw m6, m4, m13 + paddw m2, m6 + pmaddubsw m4, m12 + lea r0, [r0 + r1 * 4] + movu xm6, [r0] ; m6 = row 20 + punpckhbw xm7, xm5, xm6 + punpcklbw xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddubsw m7, m5, m13 + paddw m3, m7 + pmaddubsw m5, m12 + movu xm7, [r0 + r1] ; m7 = row 21 + punpckhbw xm0, xm6, xm7 + punpcklbw xm6, xm7 + vinserti128 m6, m6, xm0, 1 + pmaddubsw m0, m6, m13 + paddw m4, m0 + pmaddubsw m6, m12 + movu xm0, [r0 + r1 * 2] ; m0 = row 22 + punpckhbw xm1, xm7, xm0 + punpcklbw xm7, xm0 + vinserti128 m7, m7, xm1, 1 + pmaddubsw m1, m7, m13 + paddw m5, m1 + pmaddubsw m7, m12 + movu xm1, [r0 + r4] ; m1 = row 23 + punpckhbw xm8, xm0, xm1 + punpcklbw xm0, xm1 + vinserti128 m0, m0, xm8, 1 + pmaddubsw m8, m0, m13 + paddw m6, m8 + pmaddubsw m0, m12 + lea r0, [r0 + r1 * 4] + movu xm8, [r0] ; m8 = row 24 + punpckhbw xm9, xm1, xm8 + punpcklbw xm1, xm8 + vinserti128 m1, m1, xm9, 1 + pmaddubsw m9, m1, m13 + paddw m7, m9 + pmaddubsw m1, m12 + movu xm9, [r0 + r1] ; m9 = row 25 + punpckhbw xm10, xm8, xm9 + punpcklbw xm8, xm9 + vinserti128 m8, m8, xm10, 1 + pmaddubsw m8, m13 + paddw m0, m8 + movu xm10, [r0 + r1 * 2] ; m10 = row 26 + punpckhbw xm11, xm9, xm10 + punpcklbw xm9, xm10 + vinserti128 m9, m9, xm11, 1 + pmaddubsw m9, m13 + paddw m1, m9 + +%ifidn %1,pp + pmulhrsw m2, m14 ; m2 = word: row 16 + pmulhrsw m3, m14 ; m3 = word: row 17 + pmulhrsw m4, m14 ; m4 = word: row 18 + pmulhrsw m5, m14 ; m5 = word: row 19 + pmulhrsw m6, m14 ; m6 = word: row 20 + pmulhrsw m7, m14 ; m7 = word: row 21 + pmulhrsw m0, m14 ; m0 = word: row 22 + pmulhrsw m1, m14 ; m1 = word: row 23 + packuswb m2, m3 + packuswb m4, m5 + packuswb m6, m7 + packuswb m0, m1 + vpermq m2, m2, q3120 + vpermq m4, m4, q3120 + vpermq m6, m6, q3120 + vpermq m0, m0, q3120 + vextracti128 xm3, m2, 1 + vextracti128 xm5, m4, 1 + vextracti128 xm7, m6, 1 + vextracti128 xm1, m0, 1 + movu [r2], xm2 + movu [r2 + r3], xm3 + movu [r2 + r3 * 2], xm4 + movu [r2 + r5], xm5 + lea r2, [r2 + r3 * 4] + movu [r2], xm6 + movu [r2 + r3], xm7 + movu [r2 + r3 * 2], xm0 + movu [r2 + r5], xm1 +%else + psubw m2, m14 ; m2 = word: row 16 + psubw m3, m14 ; m3 = word: row 17 + psubw m4, m14 ; m4 = word: row 18 + psubw m5, m14 ; m5 = word: row 19 + psubw m6, m14 ; m6 = word: row 20 + psubw m7, m14 ; m7 = word: row 21 + psubw m0, m14 ; m0 = word: row 22 + psubw m1, m14 ; m1 = word: row 23 + movu [r2], m2 + movu [r2 + r3], m3 + movu [r2 + r3 * 2], m4 + movu [r2 + r5], m5 + lea r2, [r2 + r3 * 4] + movu [r2], m6 + movu [r2 + r3], m7 + movu [r2 + r3 * 2], m0 + movu [r2 + r5], m1 +%endif + RET +%endif +%endmacro + + FILTER_VER_CHROMA_AVX2_16x24 pp + FILTER_VER_CHROMA_AVX2_16x24 ps + +%macro FILTER_VER_CHROMA_AVX2_24x32 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_24x32, 4, 9, 10 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + mova m8, [r5] + mova m9, [r5 + mmsize] + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m7, [pw_512] +%else + add r3d, r3d + vbroadcasti128 m7, [pw_2000] +%endif + lea r6, [r3 * 3] + mov r5d, 2 +.loopH: + movu xm0, [r0] + vinserti128 m0, m0, [r0 + r1 * 2], 1 + movu xm1, [r0 + r1] + vinserti128 m1, m1, [r0 + r4], 1 + + punpcklbw m2, m0, m1 + punpckhbw m3, m0, m1 + vperm2i128 m4, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + pmaddubsw m4, m8 + pmaddubsw m3, m2, m9 + paddw m4, m3 + pmaddubsw m2, m8 + + vextracti128 xm0, m0, 1 + lea r7, [r0 + r1 * 4] + vinserti128 m0, m0, [r7], 1 + + punpcklbw m5, m1, m0 + punpckhbw m3, m1, m0 + vperm2i128 m6, m5, m3, 0x20 + vperm2i128 m5, m5, m3, 0x31 + pmaddubsw m6, m8 + pmaddubsw m3, m5, m9 + paddw m6, m3 + pmaddubsw m5, m8 +%ifidn %1,pp + pmulhrsw m4, m7 ; m4 = word: row 0 + pmulhrsw m6, m7 ; m6 = word: row 1 + packuswb m4, m6 + vpermq m4, m4, 11011000b + vextracti128 xm6, m4, 1 + movu [r2], xm4 + movu [r2 + r3], xm6 +%else + psubw m4, m7 ; m4 = word: row 0 + psubw m6, m7 ; m6 = word: row 1 + movu [r2], m4 + movu [r2 + r3], m6 +%endif + + movu xm4, [r7 + r1 * 2] + vinserti128 m4, m4, [r7 + r1], 1 + vextracti128 xm1, m4, 1 + vinserti128 m0, m0, xm1, 0 + + punpcklbw m6, m0, m4 + punpckhbw m1, m0, m4 + vperm2i128 m0, m6, m1, 0x20 + vperm2i128 m6, m6, m1, 0x31 + pmaddubsw m1, m0, m9 + paddw m5, m1 + pmaddubsw m0, m8 + pmaddubsw m1, m6, m9 + paddw m2, m1 + pmaddubsw m6, m8 + +%ifidn %1,pp + pmulhrsw m2, m7 ; m2 = word: row 2 + pmulhrsw m5, m7 ; m5 = word: row 3 + packuswb m2, m5 + vpermq m2, m2, 11011000b + vextracti128 xm5, m2, 1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm5 +%else + psubw m2, m7 ; m2 = word: row 2 + psubw m5, m7 ; m5 = word: row 3 + movu [r2 + r3 * 2], m2 + movu [r2 + r6], m5 +%endif + lea r8, [r2 + r3 * 4] + + movu xm1, [r7 + r4] + lea r7, [r7 + r1 * 4] + vinserti128 m1, m1, [r7], 1 + vinserti128 m4, m4, xm1, 1 + + punpcklbw m2, m4, m1 + punpckhbw m5, m4, m1 + vperm2i128 m3, m2, m5, 0x20 + vperm2i128 m2, m2, m5, 0x31 + pmaddubsw m5, m3, m9 + paddw m6, m5 + pmaddubsw m3, m8 + pmaddubsw m5, m2, m9 + paddw m0, m5 + pmaddubsw m2, m8 + +%ifidn %1,pp + pmulhrsw m6, m7 ; m6 = word: row 4 + pmulhrsw m0, m7 ; m0 = word: row 5 + packuswb m6, m0 + vpermq m6, m6, 11011000b + vextracti128 xm0, m6, 1 + movu [r8], xm6 + movu [r8 + r3], xm0 +%else + psubw m6, m7 ; m6 = word: row 4 + psubw m0, m7 ; m0 = word: row 5 + movu [r8], m6 + movu [r8 + r3], m0 +%endif + + movu xm6, [r7 + r1 * 2] + vinserti128 m6, m6, [r7 + r1], 1 + vextracti128 xm0, m6, 1 + vinserti128 m1, m1, xm0, 0 + + punpcklbw m4, m1, m6 + punpckhbw m5, m1, m6 + vperm2i128 m0, m4, m5, 0x20 + vperm2i128 m5, m4, m5, 0x31 + pmaddubsw m4, m0, m9 + paddw m2, m4 + pmaddubsw m0, m8 + pmaddubsw m4, m5, m9 + paddw m3, m4 + pmaddubsw m5, m8 + +%ifidn %1,pp + pmulhrsw m3, m7 ; m3 = word: row 6 + pmulhrsw m2, m7 ; m2 = word: row 7 + packuswb m3, m2 + vpermq m3, m3, 11011000b + vextracti128 xm2, m3, 1 + movu [r8 + r3 * 2], xm3 + movu [r8 + r6], xm2 +%else + psubw m3, m7 ; m3 = word: row 6 + psubw m2, m7 ; m2 = word: row 7 + movu [r8 + r3 * 2], m3 + movu [r8 + r6], m2 +%endif + lea r8, [r8 + r3 * 4] + + movu xm3, [r7 + r4] + lea r7, [r7 + r1 * 4] + vinserti128 m3, m3, [r7], 1 + vinserti128 m6, m6, xm3, 1 + + punpcklbw m2, m6, m3 + punpckhbw m1, m6, m3 + vperm2i128 m4, m2, m1, 0x20 + vperm2i128 m2, m2, m1, 0x31 + pmaddubsw m1, m4, m9 + paddw m5, m1 + pmaddubsw m4, m8 + pmaddubsw m1, m2, m9 + paddw m0, m1 + pmaddubsw m2, m8 + +%ifidn %1,pp + pmulhrsw m5, m7 ; m5 = word: row 8 + pmulhrsw m0, m7 ; m0 = word: row 9 + packuswb m5, m0 + vpermq m5, m5, 11011000b + vextracti128 xm0, m5, 1 + movu [r8], xm5 + movu [r8 + r3], xm0 +%else + psubw m5, m7 ; m5 = word: row 8 + psubw m0, m7 ; m0 = word: row 9 + movu [r8], m5 + movu [r8 + r3], m0 +%endif + + movu xm5, [r7 + r1 * 2] + vinserti128 m5, m5, [r7 + r1], 1 + vextracti128 xm0, m5, 1 + vinserti128 m3, m3, xm0, 0 + + punpcklbw m1, m3, m5 + punpckhbw m0, m3, m5 + vperm2i128 m6, m1, m0, 0x20 + vperm2i128 m0, m1, m0, 0x31 + pmaddubsw m1, m6, m9 + paddw m2, m1 + pmaddubsw m6, m8 + pmaddubsw m1, m0, m9 + paddw m4, m1 + pmaddubsw m0, m8 + +%ifidn %1,pp + pmulhrsw m4, m7 ; m4 = word: row 10 + pmulhrsw m2, m7 ; m2 = word: row 11 + packuswb m4, m2 + vpermq m4, m4, 11011000b + vextracti128 xm2, m4, 1 + movu [r8 + r3 * 2], xm4 + movu [r8 + r6], xm2 +%else + psubw m4, m7 ; m4 = word: row 10 + psubw m2, m7 ; m2 = word: row 11 + movu [r8 + r3 * 2], m4 + movu [r8 + r6], m2 +%endif + lea r8, [r8 + r3 * 4] + + movu xm3, [r7 + r4] + lea r7, [r7 + r1 * 4] + vinserti128 m3, m3, [r7], 1 + vinserti128 m5, m5, xm3, 1 + + punpcklbw m2, m5, m3 + punpckhbw m1, m5, m3 + vperm2i128 m4, m2, m1, 0x20 + vperm2i128 m2, m2, m1, 0x31 + pmaddubsw m1, m4, m9 + paddw m0, m1 + pmaddubsw m4, m8 + pmaddubsw m1, m2, m9 + paddw m6, m1 + pmaddubsw m2, m8 + +%ifidn %1,pp + pmulhrsw m0, m7 ; m0 = word: row 12 + pmulhrsw m6, m7 ; m6 = word: row 13 + packuswb m0, m6 + vpermq m0, m0, 11011000b + vextracti128 xm6, m0, 1 + movu [r8], xm0 + movu [r8 + r3], xm6 +%else + psubw m0, m7 ; m0 = word: row 12 + psubw m6, m7 ; m6 = word: row 13 + movu [r8], m0 + movu [r8 + r3], m6 +%endif + + movu xm5, [r7 + r1 * 2] + vinserti128 m5, m5, [r7 + r1], 1 + vextracti128 xm0, m5, 1 + vinserti128 m3, m3, xm0, 0 + + punpcklbw m1, m3, m5 + punpckhbw m0, m3, m5 + vperm2i128 m6, m1, m0, 0x20 + vperm2i128 m0, m1, m0, 0x31 + pmaddubsw m6, m9 + paddw m2, m6 + pmaddubsw m0, m9 + paddw m4, m0 + +%ifidn %1,pp + pmulhrsw m4, m7 ; m4 = word: row 14 + pmulhrsw m2, m7 ; m2 = word: row 15 + packuswb m4, m2 + vpermq m4, m4, 11011000b + vextracti128 xm2, m4, 1 + movu [r8 + r3 * 2], xm4 + movu [r8 + r6], xm2 + add r2, 16 +%else + psubw m4, m7 ; m4 = word: row 14 + psubw m2, m7 ; m2 = word: row 15 + movu [r8 + r3 * 2], m4 + movu [r8 + r6], m2 + add r2, 32 +%endif + add r0, 16 + movq xm1, [r0] ; m1 = row 0 + movq xm2, [r0 + r1] ; m2 = row 1 + punpcklbw xm1, xm2 + movq xm3, [r0 + r1 * 2] ; m3 = row 2 + punpcklbw xm2, xm3 + vinserti128 m5, m1, xm2, 1 + pmaddubsw m5, m8 + movq xm4, [r0 + r4] ; m4 = row 3 + punpcklbw xm3, xm4 + lea r7, [r0 + r1 * 4] + movq xm1, [r7] ; m1 = row 4 + punpcklbw xm4, xm1 + vinserti128 m2, m3, xm4, 1 + pmaddubsw m0, m2, m9 + paddw m5, m0 + pmaddubsw m2, m8 + movq xm3, [r7 + r1] ; m3 = row 5 + punpcklbw xm1, xm3 + movq xm4, [r7 + r1 * 2] ; m4 = row 6 + punpcklbw xm3, xm4 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m0, m1, m9 + paddw m2, m0 + pmaddubsw m1, m8 + movq xm3, [r7 + r4] ; m3 = row 7 + punpcklbw xm4, xm3 + lea r7, [r7 + r1 * 4] + movq xm0, [r7] ; m0 = row 8 + punpcklbw xm3, xm0 + vinserti128 m4, m4, xm3, 1 + pmaddubsw m3, m4, m9 + paddw m1, m3 + pmaddubsw m4, m8 + movq xm3, [r7 + r1] ; m3 = row 9 + punpcklbw xm0, xm3 + movq xm6, [r7 + r1 * 2] ; m6 = row 10 + punpcklbw xm3, xm6 + vinserti128 m0, m0, xm3, 1 + pmaddubsw m3, m0, m9 + paddw m4, m3 + pmaddubsw m0, m8 + +%ifidn %1,pp + pmulhrsw m5, m7 ; m5 = word: row 0, row 1 + pmulhrsw m2, m7 ; m2 = word: row 2, row 3 + pmulhrsw m1, m7 ; m1 = word: row 4, row 5 + pmulhrsw m4, m7 ; m4 = word: row 6, row 7 + packuswb m5, m2 + packuswb m1, m4 + vextracti128 xm2, m5, 1 + vextracti128 xm4, m1, 1 + movq [r2], xm5 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm5 + movhps [r2 + r6], xm2 + lea r8, [r2 + r3 * 4] + movq [r8], xm1 + movq [r8 + r3], xm4 + movhps [r8 + r3 * 2], xm1 + movhps [r8 + r6], xm4 +%else + psubw m5, m7 ; m5 = word: row 0, row 1 + psubw m2, m7 ; m2 = word: row 2, row 3 + psubw m1, m7 ; m1 = word: row 4, row 5 + psubw m4, m7 ; m4 = word: row 6, row 7 + vextracti128 xm3, m5, 1 + movu [r2], xm5 + movu [r2 + r3], xm3 + vextracti128 xm3, m2, 1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm3 + vextracti128 xm3, m1, 1 + lea r8, [r2 + r3 * 4] + movu [r8], xm1 + movu [r8 + r3], xm3 + vextracti128 xm3, m4, 1 + movu [r8 + r3 * 2], xm4 + movu [r8 + r6], xm3 +%endif + lea r8, [r8 + r3 * 4] + + movq xm3, [r7 + r4] ; m3 = row 11 + punpcklbw xm6, xm3 + lea r7, [r7 + r1 * 4] + movq xm5, [r7] ; m5 = row 12 + punpcklbw xm3, xm5 + vinserti128 m6, m6, xm3, 1 + pmaddubsw m3, m6, m9 + paddw m0, m3 + pmaddubsw m6, m8 + movq xm3, [r7 + r1] ; m3 = row 13 + punpcklbw xm5, xm3 + movq xm2, [r7 + r1 * 2] ; m2 = row 14 + punpcklbw xm3, xm2 + vinserti128 m5, m5, xm3, 1 + pmaddubsw m3, m5, m9 + paddw m6, m3 + pmaddubsw m5, m8 + movq xm3, [r7 + r4] ; m3 = row 15 + punpcklbw xm2, xm3 + lea r7, [r7 + r1 * 4] + movq xm1, [r7] ; m1 = row 16 + punpcklbw xm3, xm1 + vinserti128 m2, m2, xm3, 1 + pmaddubsw m3, m2, m9 + paddw m5, m3 + pmaddubsw m2, m8 + movq xm3, [r7 + r1] ; m3 = row 17 + punpcklbw xm1, xm3 + movq xm4, [r7 + r1 * 2] ; m4 = row 18 + punpcklbw xm3, xm4 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m3, m1, m9 + paddw m2, m3 +%ifidn %1,pp + pmulhrsw m0, m7 ; m0 = word: row 8, row 9 + pmulhrsw m6, m7 ; m6 = word: row 10, row 11 + pmulhrsw m5, m7 ; m5 = word: row 12, row 13 + pmulhrsw m2, m7 ; m2 = word: row 14, row 15 + packuswb m0, m6 + packuswb m5, m2 + vextracti128 xm6, m0, 1 + vextracti128 xm2, m5, 1 + movq [r8], xm0 + movq [r8 + r3], xm6 + movhps [r8 + r3 * 2], xm0 + movhps [r8 + r6], xm6 + lea r8, [r8 + r3 * 4] + movq [r8], xm5 + movq [r8 + r3], xm2 + movhps [r8 + r3 * 2], xm5 + movhps [r8 + r6], xm2 + lea r2, [r8 + r3 * 4 - 16] +%else + psubw m0, m7 ; m0 = word: row 8, row 9 + psubw m6, m7 ; m6 = word: row 10, row 11 + psubw m5, m7 ; m5 = word: row 12, row 13 + psubw m2, m7 ; m2 = word: row 14, row 15 + vextracti128 xm3, m0, 1 + movu [r8], xm0 + movu [r8 + r3], xm3 + vextracti128 xm3, m6, 1 + movu [r8 + r3 * 2], xm6 + movu [r8 + r6], xm3 + vextracti128 xm3, m5, 1 + lea r8, [r8 + r3 * 4] + movu [r8], xm5 + movu [r8 + r3], xm3 + vextracti128 xm3, m2, 1 + movu [r8 + r3 * 2], xm2 + movu [r8 + r6], xm3 + lea r2, [r8 + r3 * 4 - 32] +%endif + lea r0, [r7 - 16] + dec r5d + jnz .loopH + RET +%endif +%endmacro + + FILTER_VER_CHROMA_AVX2_24x32 pp + FILTER_VER_CHROMA_AVX2_24x32 ps + +%macro FILTER_VER_CHROMA_AVX2_24x64 1 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_24x64, 4, 7, 13 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + mova m10, [r5] + mova m11, [r5 + mmsize] + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m12, [pw_512] +%else + add r3d, r3d + vbroadcasti128 m12, [pw_2000] +%endif + lea r5, [r3 * 3] + mov r6d, 16 +.loopH: + movu m0, [r0] ; m0 = row 0 + movu m1, [r0 + r1] ; m1 = row 1 + punpcklbw m2, m0, m1 + punpckhbw m3, m0, m1 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + movu m0, [r0 + r1 * 2] ; m0 = row 2 + punpcklbw m4, m1, m0 + punpckhbw m5, m1, m0 + pmaddubsw m4, m10 + pmaddubsw m5, m10 + movu m1, [r0 + r4] ; m1 = row 3 + punpcklbw m6, m0, m1 + punpckhbw m7, m0, m1 + pmaddubsw m8, m6, m11 + pmaddubsw m9, m7, m11 + pmaddubsw m6, m10 + pmaddubsw m7, m10 + paddw m2, m8 + paddw m3, m9 +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2], xm2 + vextracti128 xm2, m2, 1 + movq [r2 + 16], xm2 +%else + psubw m2, m12 + psubw m3, m12 + vperm2i128 m0, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + movu [r2], m0 + movu [r2 + mmsize], xm2 +%endif + lea r0, [r0 + r1 * 4] + movu m0, [r0] ; m0 = row 4 + punpcklbw m2, m1, m0 + punpckhbw m3, m1, m0 + pmaddubsw m8, m2, m11 + pmaddubsw m9, m3, m11 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + paddw m4, m8 + paddw m5, m9 +%ifidn %1,pp + pmulhrsw m4, m12 + pmulhrsw m5, m12 + packuswb m4, m5 + movu [r2 + r3], xm4 + vextracti128 xm4, m4, 1 + movq [r2 + r3 + 16], xm4 +%else + psubw m4, m12 + psubw m5, m12 + vperm2i128 m1, m4, m5, 0x20 + vperm2i128 m4, m4, m5, 0x31 + movu [r2 + r3], m1 + movu [r2 + r3 + mmsize], xm4 +%endif + + movu m1, [r0 + r1] ; m1 = row 5 + punpcklbw m4, m0, m1 + punpckhbw m5, m0, m1 + pmaddubsw m4, m11 + pmaddubsw m5, m11 + paddw m6, m4 + paddw m7, m5 +%ifidn %1,pp + pmulhrsw m6, m12 + pmulhrsw m7, m12 + packuswb m6, m7 + movu [r2 + r3 * 2], xm6 + vextracti128 xm6, m6, 1 + movq [r2 + r3 * 2 + 16], xm6 +%else + psubw m6, m12 + psubw m7, m12 + vperm2i128 m0, m6, m7, 0x20 + vperm2i128 m6, m6, m7, 0x31 + movu [r2 + r3 * 2], m0 + movu [r2 + r3 * 2 + mmsize], xm6 +%endif + + movu m0, [r0 + r1 * 2] ; m0 = row 6 + punpcklbw m6, m1, m0 + punpckhbw m7, m1, m0 + pmaddubsw m6, m11 + pmaddubsw m7, m11 + paddw m2, m6 + paddw m3, m7 +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2 + r5], xm2 + vextracti128 xm2, m2, 1 + movq [r2 + r5 + 16], xm2 +%else + psubw m2, m12 + psubw m3, m12 + vperm2i128 m0, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + movu [r2 + r5], m0 + movu [r2 + r5 + mmsize], xm2 +%endif + lea r2, [r2 + r3 * 4] + dec r6d + jnz .loopH + RET +%endif +%endmacro + + FILTER_VER_CHROMA_AVX2_24x64 pp + FILTER_VER_CHROMA_AVX2_24x64 ps + +%macro FILTER_VER_CHROMA_AVX2_16x4 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_16x4, 4, 6, 8 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m7, [pw_512] +%else + add r3d, r3d + mova m7, [pw_2000] +%endif + + movu xm0, [r0] + vinserti128 m0, m0, [r0 + r1 * 2], 1 + movu xm1, [r0 + r1] + vinserti128 m1, m1, [r0 + r4], 1 + + punpcklbw m2, m0, m1 + punpckhbw m3, m0, m1 + vperm2i128 m4, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + pmaddubsw m4, [r5] + pmaddubsw m3, m2, [r5 + mmsize] + paddw m4, m3 + pmaddubsw m2, [r5] + + vextracti128 xm0, m0, 1 + lea r0, [r0 + r1 * 4] + vinserti128 m0, m0, [r0], 1 + + punpcklbw m5, m1, m0 + punpckhbw m3, m1, m0 + vperm2i128 m6, m5, m3, 0x20 + vperm2i128 m5, m5, m3, 0x31 + pmaddubsw m6, [r5] + pmaddubsw m3, m5, [r5 + mmsize] + paddw m6, m3 + pmaddubsw m5, [r5] +%ifidn %1,pp + pmulhrsw m4, m7 ; m4 = word: row 0 + pmulhrsw m6, m7 ; m6 = word: row 1 + packuswb m4, m6 + vpermq m4, m4, 11011000b + vextracti128 xm6, m4, 1 + movu [r2], xm4 + movu [r2 + r3], xm6 +%else + psubw m4, m7 ; m4 = word: row 0 + psubw m6, m7 ; m6 = word: row 1 + movu [r2], m4 + movu [r2 + r3], m6 +%endif + lea r2, [r2 + r3 * 2] + + movu xm4, [r0 + r1 * 2] + vinserti128 m4, m4, [r0 + r1], 1 + vextracti128 xm1, m4, 1 + vinserti128 m0, m0, xm1, 0 + + punpcklbw m6, m0, m4 + punpckhbw m1, m0, m4 + vperm2i128 m0, m6, m1, 0x20 + vperm2i128 m6, m6, m1, 0x31 + pmaddubsw m0, [r5 + mmsize] + paddw m5, m0 + pmaddubsw m6, [r5 + mmsize] + paddw m2, m6 + +%ifidn %1,pp + pmulhrsw m2, m7 ; m2 = word: row 2 + pmulhrsw m5, m7 ; m5 = word: row 3 + packuswb m2, m5 + vpermq m2, m2, 11011000b + vextracti128 xm5, m2, 1 + movu [r2], xm2 + movu [r2 + r3], xm5 +%else + psubw m2, m7 ; m2 = word: row 2 + psubw m5, m7 ; m5 = word: row 3 + movu [r2], m2 + movu [r2 + r3], m5 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_16x4 pp + FILTER_VER_CHROMA_AVX2_16x4 ps + +%macro FILTER_VER_CHROMA_AVX2_12xN 2 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_12x%2, 4, 7, 8 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m7, [pw_512] +%else + add r3d, r3d + vbroadcasti128 m7, [pw_2000] +%endif + lea r6, [r3 * 3] +%rep %2 / 16 + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhbw xm2, xm0, xm1 + punpcklbw xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddubsw m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhbw xm3, xm1, xm2 + punpcklbw xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhbw xm4, xm2, xm3 + punpcklbw xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddubsw m4, m2, [r5 + 1 * mmsize] + paddw m0, m4 + pmaddubsw m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhbw xm5, xm3, xm4 + punpcklbw xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddubsw m5, m3, [r5 + 1 * mmsize] + paddw m1, m5 + pmaddubsw m3, [r5] +%ifidn %1,pp + pmulhrsw m0, m7 ; m0 = word: row 0 + pmulhrsw m1, m7 ; m1 = word: row 1 + packuswb m0, m1 + vextracti128 xm1, m0, 1 + movq [r2], xm0 + movd [r2 + 8], xm1 + movhps [r2 + r3], xm0 + pextrd [r2 + r3 + 8], xm1, 2 +%else + psubw m0, m7 ; m0 = word: row 0 + psubw m1, m7 ; m1 = word: row 1 + movu [r2], xm0 + vextracti128 xm0, m0, 1 + movq [r2 + 16], xm0 + movu [r2 + r3], xm1 + vextracti128 xm1, m1, 1 + movq [r2 + r3 + 16], xm1 +%endif + + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhbw xm6, xm4, xm5 + punpcklbw xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddubsw m6, m4, [r5 + 1 * mmsize] + paddw m2, m6 + pmaddubsw m4, [r5] + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhbw xm0, xm5, xm6 + punpcklbw xm5, xm6 + vinserti128 m5, m5, xm0, 1 + pmaddubsw m0, m5, [r5 + 1 * mmsize] + paddw m3, m0 + pmaddubsw m5, [r5] +%ifidn %1,pp + pmulhrsw m2, m7 ; m2 = word: row 2 + pmulhrsw m3, m7 ; m3 = word: row 3 + packuswb m2, m3 + vextracti128 xm3, m2, 1 + movq [r2 + r3 * 2], xm2 + movd [r2 + r3 * 2 + 8], xm3 + movhps [r2 + r6], xm2 + pextrd [r2 + r6 + 8], xm3, 2 +%else + psubw m2, m7 ; m2 = word: row 2 + psubw m3, m7 ; m3 = word: row 3 + movu [r2 + r3 * 2], xm2 + vextracti128 xm2, m2, 1 + movq [r2 + r3 * 2 + 16], xm2 + movu [r2 + r6], xm3 + vextracti128 xm3, m3, 1 + movq [r2 + r6 + 16], xm3 +%endif + lea r2, [r2 + r3 * 4] + + movu xm0, [r0 + r4] ; m0 = row 7 + punpckhbw xm3, xm6, xm0 + punpcklbw xm6, xm0 + vinserti128 m6, m6, xm3, 1 + pmaddubsw m3, m6, [r5 + 1 * mmsize] + paddw m4, m3 + pmaddubsw m6, [r5] + lea r0, [r0 + r1 * 4] + movu xm3, [r0] ; m3 = row 8 + punpckhbw xm1, xm0, xm3 + punpcklbw xm0, xm3 + vinserti128 m0, m0, xm1, 1 + pmaddubsw m1, m0, [r5 + 1 * mmsize] + paddw m5, m1 + pmaddubsw m0, [r5] +%ifidn %1,pp + pmulhrsw m4, m7 ; m4 = word: row 4 + pmulhrsw m5, m7 ; m5 = word: row 5 + packuswb m4, m5 + vextracti128 xm5, m4, 1 + movq [r2], xm4 + movd [r2 + 8], xm5 + movhps [r2 + r3], xm4 + pextrd [r2 + r3 + 8], xm5, 2 +%else + psubw m4, m7 ; m4 = word: row 4 + psubw m5, m7 ; m5 = word: row 5 + movu [r2], xm4 + vextracti128 xm4, m4, 1 + movq [r2 + 16], xm4 + movu [r2 + r3], xm5 + vextracti128 xm5, m5, 1 + movq [r2 + r3 + 16], xm5 +%endif + + movu xm1, [r0 + r1] ; m1 = row 9 + punpckhbw xm2, xm3, xm1 + punpcklbw xm3, xm1 + vinserti128 m3, m3, xm2, 1 + pmaddubsw m2, m3, [r5 + 1 * mmsize] + paddw m6, m2 + pmaddubsw m3, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 10 + punpckhbw xm4, xm1, xm2 + punpcklbw xm1, xm2 + vinserti128 m1, m1, xm4, 1 + pmaddubsw m4, m1, [r5 + 1 * mmsize] + paddw m0, m4 + pmaddubsw m1, [r5] + +%ifidn %1,pp + pmulhrsw m6, m7 ; m6 = word: row 6 + pmulhrsw m0, m7 ; m0 = word: row 7 + packuswb m6, m0 + vextracti128 xm0, m6, 1 + movq [r2 + r3 * 2], xm6 + movd [r2 + r3 * 2 + 8], xm0 + movhps [r2 + r6], xm6 + pextrd [r2 + r6 + 8], xm0, 2 +%else + psubw m6, m7 ; m6 = word: row 6 + psubw m0, m7 ; m0 = word: row 7 + movu [r2 + r3 * 2], xm6 + vextracti128 xm6, m6, 1 + movq [r2 + r3 * 2 + 16], xm6 + movu [r2 + r6], xm0 + vextracti128 xm0, m0, 1 + movq [r2 + r6 + 16], xm0 +%endif + lea r2, [r2 + r3 * 4] + + movu xm4, [r0 + r4] ; m4 = row 11 + punpckhbw xm6, xm2, xm4 + punpcklbw xm2, xm4 + vinserti128 m2, m2, xm6, 1 + pmaddubsw m6, m2, [r5 + 1 * mmsize] + paddw m3, m6 + pmaddubsw m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm6, [r0] ; m6 = row 12 + punpckhbw xm0, xm4, xm6 + punpcklbw xm4, xm6 + vinserti128 m4, m4, xm0, 1 + pmaddubsw m0, m4, [r5 + 1 * mmsize] + paddw m1, m0 + pmaddubsw m4, [r5] +%ifidn %1,pp + pmulhrsw m3, m7 ; m3 = word: row 8 + pmulhrsw m1, m7 ; m1 = word: row 9 + packuswb m3, m1 + vextracti128 xm1, m3, 1 + movq [r2], xm3 + movd [r2 + 8], xm1 + movhps [r2 + r3], xm3 + pextrd [r2 + r3 + 8], xm1, 2 +%else + psubw m3, m7 ; m3 = word: row 8 + psubw m1, m7 ; m1 = word: row 9 + movu [r2], xm3 + vextracti128 xm3, m3, 1 + movq [r2 + 16], xm3 + movu [r2 + r3], xm1 + vextracti128 xm1, m1, 1 + movq [r2 + r3 + 16], xm1 +%endif + + movu xm0, [r0 + r1] ; m0 = row 13 + punpckhbw xm1, xm6, xm0 + punpcklbw xm6, xm0 + vinserti128 m6, m6, xm1, 1 + pmaddubsw m1, m6, [r5 + 1 * mmsize] + paddw m2, m1 + pmaddubsw m6, [r5] + movu xm1, [r0 + r1 * 2] ; m1 = row 14 + punpckhbw xm5, xm0, xm1 + punpcklbw xm0, xm1 + vinserti128 m0, m0, xm5, 1 + pmaddubsw m5, m0, [r5 + 1 * mmsize] + paddw m4, m5 + pmaddubsw m0, [r5] +%ifidn %1,pp + pmulhrsw m2, m7 ; m2 = word: row 10 + pmulhrsw m4, m7 ; m4 = word: row 11 + packuswb m2, m4 + vextracti128 xm4, m2, 1 + movq [r2 + r3 * 2], xm2 + movd [r2 + r3 * 2 + 8], xm4 + movhps [r2 + r6], xm2 + pextrd [r2 + r6 + 8], xm4, 2 +%else + psubw m2, m7 ; m2 = word: row 10 + psubw m4, m7 ; m4 = word: row 11 + movu [r2 + r3 * 2], xm2 + vextracti128 xm2, m2, 1 + movq [r2 + r3 * 2 + 16], xm2 + movu [r2 + r6], xm4 + vextracti128 xm4, m4, 1 + movq [r2 + r6 + 16], xm4 +%endif + lea r2, [r2 + r3 * 4] + + movu xm5, [r0 + r4] ; m5 = row 15 + punpckhbw xm2, xm1, xm5 + punpcklbw xm1, xm5 + vinserti128 m1, m1, xm2, 1 + pmaddubsw m2, m1, [r5 + 1 * mmsize] + paddw m6, m2 + pmaddubsw m1, [r5] + lea r0, [r0 + r1 * 4] + movu xm2, [r0] ; m2 = row 16 + punpckhbw xm3, xm5, xm2 + punpcklbw xm5, xm2 + vinserti128 m5, m5, xm3, 1 + pmaddubsw m3, m5, [r5 + 1 * mmsize] + paddw m0, m3 + pmaddubsw m5, [r5] + movu xm3, [r0 + r1] ; m3 = row 17 + punpckhbw xm4, xm2, xm3 + punpcklbw xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddubsw m2, [r5 + 1 * mmsize] + paddw m1, m2 + movu xm4, [r0 + r1 * 2] ; m4 = row 18 + punpckhbw xm2, xm3, xm4 + punpcklbw xm3, xm4 + vinserti128 m3, m3, xm2, 1 + pmaddubsw m3, [r5 + 1 * mmsize] + paddw m5, m3 + +%ifidn %1,pp + pmulhrsw m6, m7 ; m6 = word: row 12 + pmulhrsw m0, m7 ; m0 = word: row 13 + pmulhrsw m1, m7 ; m1 = word: row 14 + pmulhrsw m5, m7 ; m5 = word: row 15 + packuswb m6, m0 + packuswb m1, m5 + vextracti128 xm0, m6, 1 + vextracti128 xm5, m1, 1 + movq [r2], xm6 + movd [r2 + 8], xm0 + movhps [r2 + r3], xm6 + pextrd [r2 + r3 + 8], xm0, 2 + movq [r2 + r3 * 2], xm1 + movd [r2 + r3 * 2 + 8], xm5 + movhps [r2 + r6], xm1 + pextrd [r2 + r6 + 8], xm5, 2 +%else + psubw m6, m7 ; m6 = word: row 12 + psubw m0, m7 ; m0 = word: row 13 + psubw m1, m7 ; m1 = word: row 14 + psubw m5, m7 ; m5 = word: row 15 + movu [r2], xm6 + vextracti128 xm6, m6, 1 + movq [r2 + 16], xm6 + movu [r2 + r3], xm0 + vextracti128 xm0, m0, 1 + movq [r2 + r3 + 16], xm0 + movu [r2 + r3 * 2], xm1 + vextracti128 xm1, m1, 1 + movq [r2 + r3 * 2 + 16], xm1 + movu [r2 + r6], xm5 + vextracti128 xm5, m5, 1 + movq [r2 + r6 + 16], xm5 +%endif + lea r2, [r2 + r3 * 4] +%endrep + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_12xN pp, 16 + FILTER_VER_CHROMA_AVX2_12xN ps, 16 + FILTER_VER_CHROMA_AVX2_12xN pp, 32 + FILTER_VER_CHROMA_AVX2_12xN ps, 32 + +;----------------------------------------------------------------------------- +;void interp_4tap_vert_pp_24x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W24 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_pp_24x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] + + mov r4d, %2 + +.loop: + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + pmaddubsw m4, m1 + pmaddubsw m2, m1 + + lea r5, [r0 + 2 * r1] + movu m5, [r5] + movu m7, [r5 + r1] + + punpcklbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m4, m6 + + punpckhbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m2, m6 + + mova m6, [pw_512] + + pmulhrsw m4, m6 + pmulhrsw m2, m6 + + packuswb m4, m2 + + movu [r2], m4 + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + pmaddubsw m4, m1 + pmaddubsw m3, m1 + + movu m2, [r5 + 2 * r1] + + punpcklbw m5, m7, m2 + punpckhbw m7, m2 + + pmaddubsw m5, m0 + pmaddubsw m7, m0 + + paddw m4, m5 + paddw m3, m7 + + pmulhrsw m4, m6 + pmulhrsw m3, m6 + + packuswb m4, m3 + + movu [r2 + r3], m4 + + movq m2, [r0 + 16] + movq m3, [r0 + r1 + 16] + movq m4, [r5 + 16] + movq m5, [r5 + r1 + 16] + + punpcklbw m2, m3 + punpcklbw m4, m5 + + pmaddubsw m2, m1 + pmaddubsw m4, m0 + + paddw m2, m4 + + pmulhrsw m2, m6 + + movq m3, [r0 + r1 + 16] + movq m4, [r5 + 16] + movq m5, [r5 + r1 + 16] + movq m7, [r5 + 2 * r1 + 16] + + punpcklbw m3, m4 + punpcklbw m5, m7 + + pmaddubsw m3, m1 + pmaddubsw m5, m0 + + paddw m3, m5 + + pmulhrsw m3, m6 + packuswb m2, m3 + + movh [r2 + 16], m2 + movhps [r2 + r3 + 16], m2 + + mov r0, r5 + lea r2, [r2 + 2 * r3] + + sub r4, 2 + jnz .loop + RET +%endmacro + + FILTER_V4_W24 24, 32 + + FILTER_V4_W24 24, 64 + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_32x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W32 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] + + mova m7, [pw_512] + + mov r4d, %2 + +.loop: + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + pmaddubsw m4, m1 + pmaddubsw m2, m1 + + lea r5, [r0 + 2 * r1] + movu m3, [r5] + movu m5, [r5 + r1] + + punpcklbw m6, m3, m5 + punpckhbw m3, m5 + + pmaddubsw m6, m0 + pmaddubsw m3, m0 + + paddw m4, m6 + paddw m2, m3 + + pmulhrsw m4, m7 + pmulhrsw m2, m7 + + packuswb m4, m2 + + movu [r2], m4 + + movu m2, [r0 + 16] + movu m3, [r0 + r1 + 16] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + pmaddubsw m4, m1 + pmaddubsw m2, m1 + + movu m3, [r5 + 16] + movu m5, [r5 + r1 + 16] + + punpcklbw m6, m3, m5 + punpckhbw m3, m5 + + pmaddubsw m6, m0 + pmaddubsw m3, m0 + + paddw m4, m6 + paddw m2, m3 + + pmulhrsw m4, m7 + pmulhrsw m2, m7 + + packuswb m4, m2 + + movu [r2 + 16], m4 + + lea r0, [r0 + r1] + lea r2, [r2 + r3] + + dec r4 + jnz .loop + RET +%endmacro + + FILTER_V4_W32 32, 8 + FILTER_V4_W32 32, 16 + FILTER_V4_W32 32, 24 + FILTER_V4_W32 32, 32 + + FILTER_V4_W32 32, 48 + FILTER_V4_W32 32, 64 + +%macro FILTER_VER_CHROMA_AVX2_32xN 2 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_32x%2, 4, 7, 13 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + mova m10, [r5] + mova m11, [r5 + mmsize] + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m12, [pw_512] +%else + add r3d, r3d + vbroadcasti128 m12, [pw_2000] +%endif + lea r5, [r3 * 3] + mov r6d, %2 / 4 +.loopW: + movu m0, [r0] ; m0 = row 0 + movu m1, [r0 + r1] ; m1 = row 1 + punpcklbw m2, m0, m1 + punpckhbw m3, m0, m1 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + movu m0, [r0 + r1 * 2] ; m0 = row 2 + punpcklbw m4, m1, m0 + punpckhbw m5, m1, m0 + pmaddubsw m4, m10 + pmaddubsw m5, m10 + movu m1, [r0 + r4] ; m1 = row 3 + punpcklbw m6, m0, m1 + punpckhbw m7, m0, m1 + pmaddubsw m8, m6, m11 + pmaddubsw m9, m7, m11 + pmaddubsw m6, m10 + pmaddubsw m7, m10 + paddw m2, m8 + paddw m3, m9 +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2], m2 +%else + psubw m2, m12 + psubw m3, m12 + vperm2i128 m0, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + movu [r2], m0 + movu [r2 + mmsize], m2 +%endif + lea r0, [r0 + r1 * 4] + movu m0, [r0] ; m0 = row 4 + punpcklbw m2, m1, m0 + punpckhbw m3, m1, m0 + pmaddubsw m8, m2, m11 + pmaddubsw m9, m3, m11 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + paddw m4, m8 + paddw m5, m9 +%ifidn %1,pp + pmulhrsw m4, m12 + pmulhrsw m5, m12 + packuswb m4, m5 + movu [r2 + r3], m4 +%else + psubw m4, m12 + psubw m5, m12 + vperm2i128 m1, m4, m5, 0x20 + vperm2i128 m4, m4, m5, 0x31 + movu [r2 + r3], m1 + movu [r2 + r3 + mmsize], m4 +%endif + + movu m1, [r0 + r1] ; m1 = row 5 + punpcklbw m4, m0, m1 + punpckhbw m5, m0, m1 + pmaddubsw m4, m11 + pmaddubsw m5, m11 + paddw m6, m4 + paddw m7, m5 +%ifidn %1,pp + pmulhrsw m6, m12 + pmulhrsw m7, m12 + packuswb m6, m7 + movu [r2 + r3 * 2], m6 +%else + psubw m6, m12 + psubw m7, m12 + vperm2i128 m0, m6, m7, 0x20 + vperm2i128 m6, m6, m7, 0x31 + movu [r2 + r3 * 2], m0 + movu [r2 + r3 * 2 + mmsize], m6 +%endif + + movu m0, [r0 + r1 * 2] ; m0 = row 6 + punpcklbw m6, m1, m0 + punpckhbw m7, m1, m0 + pmaddubsw m6, m11 + pmaddubsw m7, m11 + paddw m2, m6 + paddw m3, m7 +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2 + r5], m2 +%else + psubw m2, m12 + psubw m3, m12 + vperm2i128 m0, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + movu [r2 + r5], m0 + movu [r2 + r5 + mmsize], m2 +%endif + lea r2, [r2 + r3 * 4] + dec r6d + jnz .loopW + RET +%endif +%endmacro + + FILTER_VER_CHROMA_AVX2_32xN pp, 64 + FILTER_VER_CHROMA_AVX2_32xN pp, 48 + FILTER_VER_CHROMA_AVX2_32xN pp, 32 + FILTER_VER_CHROMA_AVX2_32xN pp, 24 + FILTER_VER_CHROMA_AVX2_32xN pp, 16 + FILTER_VER_CHROMA_AVX2_32xN pp, 8 + FILTER_VER_CHROMA_AVX2_32xN ps, 64 + FILTER_VER_CHROMA_AVX2_32xN ps, 48 + FILTER_VER_CHROMA_AVX2_32xN ps, 32 + FILTER_VER_CHROMA_AVX2_32xN ps, 24 + FILTER_VER_CHROMA_AVX2_32xN ps, 16 + FILTER_VER_CHROMA_AVX2_32xN ps, 8 + +%macro FILTER_VER_CHROMA_AVX2_48x64 1 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_48x64, 4, 8, 13 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + mova m10, [r5] + mova m11, [r5 + mmsize] + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m12, [pw_512] +%else + add r3d, r3d + vbroadcasti128 m12, [pw_2000] +%endif + lea r5, [r3 * 3] + lea r7, [r1 * 4] + mov r6d, 16 +.loopH: + movu m0, [r0] ; m0 = row 0 + movu m1, [r0 + r1] ; m1 = row 1 + punpcklbw m2, m0, m1 + punpckhbw m3, m0, m1 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + movu m0, [r0 + r1 * 2] ; m0 = row 2 + punpcklbw m4, m1, m0 + punpckhbw m5, m1, m0 + pmaddubsw m4, m10 + pmaddubsw m5, m10 + movu m1, [r0 + r4] ; m1 = row 3 + punpcklbw m6, m0, m1 + punpckhbw m7, m0, m1 + pmaddubsw m8, m6, m11 + pmaddubsw m9, m7, m11 + pmaddubsw m6, m10 + pmaddubsw m7, m10 + paddw m2, m8 + paddw m3, m9 +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2], m2 +%else + psubw m2, m12 + psubw m3, m12 + vperm2i128 m0, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + movu [r2], m0 + movu [r2 + mmsize], m2 +%endif + lea r0, [r0 + r1 * 4] + movu m0, [r0] ; m0 = row 4 + punpcklbw m2, m1, m0 + punpckhbw m3, m1, m0 + pmaddubsw m8, m2, m11 + pmaddubsw m9, m3, m11 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + paddw m4, m8 + paddw m5, m9 +%ifidn %1,pp + pmulhrsw m4, m12 + pmulhrsw m5, m12 + packuswb m4, m5 + movu [r2 + r3], m4 +%else + psubw m4, m12 + psubw m5, m12 + vperm2i128 m1, m4, m5, 0x20 + vperm2i128 m4, m4, m5, 0x31 + movu [r2 + r3], m1 + movu [r2 + r3 + mmsize], m4 +%endif + + movu m1, [r0 + r1] ; m1 = row 5 + punpcklbw m4, m0, m1 + punpckhbw m5, m0, m1 + pmaddubsw m4, m11 + pmaddubsw m5, m11 + paddw m6, m4 + paddw m7, m5 +%ifidn %1,pp + pmulhrsw m6, m12 + pmulhrsw m7, m12 + packuswb m6, m7 + movu [r2 + r3 * 2], m6 +%else + psubw m6, m12 + psubw m7, m12 + vperm2i128 m0, m6, m7, 0x20 + vperm2i128 m6, m6, m7, 0x31 + movu [r2 + r3 * 2], m0 + movu [r2 + r3 * 2 + mmsize], m6 +%endif + + movu m0, [r0 + r1 * 2] ; m0 = row 6 + punpcklbw m6, m1, m0 + punpckhbw m7, m1, m0 + pmaddubsw m6, m11 + pmaddubsw m7, m11 + paddw m2, m6 + paddw m3, m7 +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2 + r5], m2 + add r2, 32 +%else + psubw m2, m12 + psubw m3, m12 + vperm2i128 m0, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + movu [r2 + r5], m0 + movu [r2 + r5 + mmsize], m2 + add r2, 64 +%endif + sub r0, r7 + + movu xm0, [r0 + 32] ; m0 = row 0 + movu xm1, [r0 + r1 + 32] ; m1 = row 1 + punpckhbw xm2, xm0, xm1 + punpcklbw xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddubsw m0, m10 + movu xm2, [r0 + r1 * 2 + 32] ; m2 = row 2 + punpckhbw xm3, xm1, xm2 + punpcklbw xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m1, m10 + movu xm3, [r0 + r4 + 32] ; m3 = row 3 + punpckhbw xm4, xm2, xm3 + punpcklbw xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddubsw m4, m2, m11 + paddw m0, m4 + pmaddubsw m2, m10 + lea r0, [r0 + r1 * 4] + movu xm4, [r0 + 32] ; m4 = row 4 + punpckhbw xm5, xm3, xm4 + punpcklbw xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddubsw m5, m3, m11 + paddw m1, m5 + pmaddubsw m3, m10 + movu xm5, [r0 + r1 + 32] ; m5 = row 5 + punpckhbw xm6, xm4, xm5 + punpcklbw xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddubsw m4, m11 + paddw m2, m4 + movu xm6, [r0 + r1 * 2 + 32] ; m6 = row 6 + punpckhbw xm7, xm5, xm6 + punpcklbw xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddubsw m5, m11 + paddw m3, m5 +%ifidn %1,pp + pmulhrsw m0, m12 ; m0 = word: row 0 + pmulhrsw m1, m12 ; m1 = word: row 1 + pmulhrsw m2, m12 ; m2 = word: row 2 + pmulhrsw m3, m12 ; m3 = word: row 3 + packuswb m0, m1 + packuswb m2, m3 + vpermq m0, m0, 11011000b + vpermq m2, m2, 11011000b + vextracti128 xm1, m0, 1 + vextracti128 xm3, m2, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r5], xm3 + lea r2, [r2 + r3 * 4 - 32] +%else + psubw m0, m12 ; m0 = word: row 0 + psubw m1, m12 ; m1 = word: row 1 + psubw m2, m12 ; m2 = word: row 2 + psubw m3, m12 ; m3 = word: row 3 + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r5], m3 + lea r2, [r2 + r3 * 4 - 64] +%endif + dec r6d + jnz .loopH + RET +%endif +%endmacro + + FILTER_VER_CHROMA_AVX2_48x64 pp + FILTER_VER_CHROMA_AVX2_48x64 ps + +%macro FILTER_VER_CHROMA_AVX2_64xN 2 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_64x%2, 4, 8, 13 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + mova m10, [r5] + mova m11, [r5 + mmsize] + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m12, [pw_512] +%else + add r3d, r3d + vbroadcasti128 m12, [pw_2000] +%endif + lea r5, [r3 * 3] + lea r7, [r1 * 4] + mov r6d, %2 / 4 +.loopH: +%assign x 0 +%rep 2 + movu m0, [r0 + x] ; m0 = row 0 + movu m1, [r0 + r1 + x] ; m1 = row 1 + punpcklbw m2, m0, m1 + punpckhbw m3, m0, m1 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + movu m0, [r0 + r1 * 2 + x] ; m0 = row 2 + punpcklbw m4, m1, m0 + punpckhbw m5, m1, m0 + pmaddubsw m4, m10 + pmaddubsw m5, m10 + movu m1, [r0 + r4 + x] ; m1 = row 3 + punpcklbw m6, m0, m1 + punpckhbw m7, m0, m1 + pmaddubsw m8, m6, m11 + pmaddubsw m9, m7, m11 + pmaddubsw m6, m10 + pmaddubsw m7, m10 + paddw m2, m8 + paddw m3, m9 +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2], m2 +%else + psubw m2, m12 + psubw m3, m12 + vperm2i128 m0, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + movu [r2], m0 + movu [r2 + mmsize], m2 +%endif + lea r0, [r0 + r1 * 4] + movu m0, [r0 + x] ; m0 = row 4 + punpcklbw m2, m1, m0 + punpckhbw m3, m1, m0 + pmaddubsw m8, m2, m11 + pmaddubsw m9, m3, m11 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + paddw m4, m8 + paddw m5, m9 +%ifidn %1,pp + pmulhrsw m4, m12 + pmulhrsw m5, m12 + packuswb m4, m5 + movu [r2 + r3], m4 +%else + psubw m4, m12 + psubw m5, m12 + vperm2i128 m1, m4, m5, 0x20 + vperm2i128 m4, m4, m5, 0x31 + movu [r2 + r3], m1 + movu [r2 + r3 + mmsize], m4 +%endif + + movu m1, [r0 + r1 + x] ; m1 = row 5 + punpcklbw m4, m0, m1 + punpckhbw m5, m0, m1 + pmaddubsw m4, m11 + pmaddubsw m5, m11 + paddw m6, m4 + paddw m7, m5 +%ifidn %1,pp + pmulhrsw m6, m12 + pmulhrsw m7, m12 + packuswb m6, m7 + movu [r2 + r3 * 2], m6 +%else + psubw m6, m12 + psubw m7, m12 + vperm2i128 m0, m6, m7, 0x20 + vperm2i128 m6, m6, m7, 0x31 + movu [r2 + r3 * 2], m0 + movu [r2 + r3 * 2 + mmsize], m6 +%endif + + movu m0, [r0 + r1 * 2 + x] ; m0 = row 6 + punpcklbw m6, m1, m0 + punpckhbw m7, m1, m0 + pmaddubsw m6, m11 + pmaddubsw m7, m11 + paddw m2, m6 + paddw m3, m7 +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2 + r5], m2 + add r2, 32 +%else + psubw m2, m12 + psubw m3, m12 + vperm2i128 m0, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + movu [r2 + r5], m0 + movu [r2 + r5 + mmsize], m2 + add r2, 64 +%endif + sub r0, r7 +%assign x x+32 +%endrep +%ifidn %1,pp + lea r2, [r2 + r3 * 4 - 64] +%else + lea r2, [r2 + r3 * 4 - 128] +%endif + add r0, r7 + dec r6d + jnz .loopH + RET +%endif +%endmacro + + FILTER_VER_CHROMA_AVX2_64xN pp, 64 + FILTER_VER_CHROMA_AVX2_64xN pp, 48 + FILTER_VER_CHROMA_AVX2_64xN pp, 32 + FILTER_VER_CHROMA_AVX2_64xN pp, 16 + FILTER_VER_CHROMA_AVX2_64xN ps, 64 + FILTER_VER_CHROMA_AVX2_64xN ps, 48 + FILTER_VER_CHROMA_AVX2_64xN ps, 32 + FILTER_VER_CHROMA_AVX2_64xN ps, 16 + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W16n_H2 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_pp_%1x%2, 4, 7, 8 + + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] + + mov r4d, %2/2 + +.loop: + + mov r6d, %1/16 + +.loopW: + + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + pmaddubsw m4, m1 + pmaddubsw m2, m1 + + lea r5, [r0 + 2 * r1] + movu m5, [r5] + movu m6, [r5 + r1] + + punpckhbw m7, m5, m6 + pmaddubsw m7, m0 + paddw m2, m7 + + punpcklbw m7, m5, m6 + pmaddubsw m7, m0 + paddw m4, m7 + + mova m7, [pw_512] + + pmulhrsw m4, m7 + pmulhrsw m2, m7 + + packuswb m4, m2 + + movu [r2], m4 + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + pmaddubsw m4, m1 + pmaddubsw m3, m1 + + movu m5, [r5 + 2 * r1] + + punpcklbw m2, m6, m5 + punpckhbw m6, m5 + + pmaddubsw m2, m0 + pmaddubsw m6, m0 + + paddw m4, m2 + paddw m3, m6 + + pmulhrsw m4, m7 + pmulhrsw m3, m7 + + packuswb m4, m3 + + movu [r2 + r3], m4 + + add r0, 16 + add r2, 16 + dec r6d + jnz .loopW + + lea r0, [r0 + r1 * 2 - %1] + lea r2, [r2 + r3 * 2 - %1] + + dec r4d + jnz .loop + RET +%endmacro + + FILTER_V4_W16n_H2 64, 64 + FILTER_V4_W16n_H2 64, 32 + FILTER_V4_W16n_H2 64, 48 + FILTER_V4_W16n_H2 48, 64 + FILTER_V4_W16n_H2 64, 16 + +;----------------------------------------------------------------------------- ; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) ;----------------------------------------------------------------------------- %macro P2S_H_2xN 1 @@ -1145,7 +12374,7 @@ mova m4, [pb_128] mova m5, [tab_c_64_n64] -.loop: +.loop movh m0, [r0] punpcklbw m0, m4 pmaddubsw m0, m5 @@ -6791,6 +18020,4888 @@ RET +%macro PROCESS_CHROMA_SP_W4_4R 0 + movq m0, [r0] + movq m1, [r0 + r1] + punpcklwd m0, m1 ;m0=[0 1] + pmaddwd m0, [r6 + 0 *16] ;m0=[0+1] Row1 + + lea r0, [r0 + 2 * r1] + movq m4, [r0] + punpcklwd m1, m4 ;m1=[1 2] + pmaddwd m1, [r6 + 0 *16] ;m1=[1+2] Row2 + + movq m5, [r0 + r1] + punpcklwd m4, m5 ;m4=[2 3] + pmaddwd m2, m4, [r6 + 0 *16] ;m2=[2+3] Row3 + pmaddwd m4, [r6 + 1 * 16] + paddd m0, m4 ;m0=[0+1+2+3] Row1 done + + lea r0, [r0 + 2 * r1] + movq m4, [r0] + punpcklwd m5, m4 ;m5=[3 4] + pmaddwd m3, m5, [r6 + 0 *16] ;m3=[3+4] Row4 + pmaddwd m5, [r6 + 1 * 16] + paddd m1, m5 ;m1 = [1+2+3+4] Row2 + + movq m5, [r0 + r1] + punpcklwd m4, m5 ;m4=[4 5] + pmaddwd m4, [r6 + 1 * 16] + paddd m2, m4 ;m2=[2+3+4+5] Row3 + + movq m4, [r0 + 2 * r1] + punpcklwd m5, m4 ;m5=[5 6] + pmaddwd m5, [r6 + 1 * 16] + paddd m3, m5 ;m3=[3+4+5+6] Row4 +%endmacro + +;-------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_sp_%1x%2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;-------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_SP 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_sp_%1x%2, 5, 7, 7 ,0-gprsize + + add r1d, r1d + sub r0, r1 + shl r4d, 5 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r6, [r5 + r4] +%else + lea r6, [tab_ChromaCoeffV + r4] +%endif + + mova m6, [pd_526336] + + mov dword [rsp], %2/4 + +.loopH: + mov r4d, (%1/4) +.loopW: + PROCESS_CHROMA_SP_W4_4R + + paddd m0, m6 + paddd m1, m6 + paddd m2, m6 + paddd m3, m6 + + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 + + packssdw m0, m1 + packssdw m2, m3 + + packuswb m0, m2 + + movd [r2], m0 + pextrd [r2 + r3], m0, 1 + lea r5, [r2 + 2 * r3] + pextrd [r5], m0, 2 + pextrd [r5 + r3], m0, 3 + + lea r5, [4 * r1 - 2 * 4] + sub r0, r5 + add r2, 4 + + dec r4d + jnz .loopW + + lea r0, [r0 + 4 * r1 - 2 * %1] + lea r2, [r2 + 4 * r3 - %1] + + dec dword [rsp] + jnz .loopH + + RET +%endmacro + + FILTER_VER_CHROMA_SP 4, 4 + FILTER_VER_CHROMA_SP 4, 8 + FILTER_VER_CHROMA_SP 16, 16 + FILTER_VER_CHROMA_SP 16, 8 + FILTER_VER_CHROMA_SP 16, 12 + FILTER_VER_CHROMA_SP 12, 16 + FILTER_VER_CHROMA_SP 16, 4 + FILTER_VER_CHROMA_SP 4, 16 + FILTER_VER_CHROMA_SP 32, 32 + FILTER_VER_CHROMA_SP 32, 16 + FILTER_VER_CHROMA_SP 16, 32 + FILTER_VER_CHROMA_SP 32, 24 + FILTER_VER_CHROMA_SP 24, 32 + FILTER_VER_CHROMA_SP 32, 8 + + FILTER_VER_CHROMA_SP 16, 24 + FILTER_VER_CHROMA_SP 16, 64 + FILTER_VER_CHROMA_SP 12, 32 + FILTER_VER_CHROMA_SP 4, 32 + FILTER_VER_CHROMA_SP 32, 64 + FILTER_VER_CHROMA_SP 32, 48 + FILTER_VER_CHROMA_SP 24, 64 + + FILTER_VER_CHROMA_SP 64, 64 + FILTER_VER_CHROMA_SP 64, 32 + FILTER_VER_CHROMA_SP 64, 48 + FILTER_VER_CHROMA_SP 48, 64 + FILTER_VER_CHROMA_SP 64, 16 + + +%macro PROCESS_CHROMA_SP_W2_4R 1 + movd m0, [r0] + movd m1, [r0 + r1] + punpcklwd m0, m1 ;m0=[0 1] + + lea r0, [r0 + 2 * r1] + movd m2, [r0] + punpcklwd m1, m2 ;m1=[1 2] + punpcklqdq m0, m1 ;m0=[0 1 1 2] + pmaddwd m0, [%1 + 0 *16] ;m0=[0+1 1+2] Row 1-2 + + movd m1, [r0 + r1] + punpcklwd m2, m1 ;m2=[2 3] + + lea r0, [r0 + 2 * r1] + movd m3, [r0] + punpcklwd m1, m3 ;m2=[3 4] + punpcklqdq m2, m1 ;m2=[2 3 3 4] + + pmaddwd m4, m2, [%1 + 1 * 16] ;m4=[2+3 3+4] Row 1-2 + pmaddwd m2, [%1 + 0 * 16] ;m2=[2+3 3+4] Row 3-4 + paddd m0, m4 ;m0=[0+1+2+3 1+2+3+4] Row 1-2 + + movd m1, [r0 + r1] + punpcklwd m3, m1 ;m3=[4 5] + + movd m4, [r0 + 2 * r1] + punpcklwd m1, m4 ;m1=[5 6] + punpcklqdq m3, m1 ;m2=[4 5 5 6] + pmaddwd m3, [%1 + 1 * 16] ;m3=[4+5 5+6] Row 3-4 + paddd m2, m3 ;m2=[2+3+4+5 3+4+5+6] Row 3-4 +%endmacro + +;------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vertical_sp_%1x%2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_SP_W2_4R 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_sp_%1x%2, 5, 6, 6 + + add r1d, r1d + sub r0, r1 + shl r4d, 5 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + + mova m5, [pd_526336] + + mov r4d, (%2/4) + +.loopH: + PROCESS_CHROMA_SP_W2_4R r5 + + paddd m0, m5 + paddd m2, m5 + + psrad m0, 12 + psrad m2, 12 + + packssdw m0, m2 + packuswb m0, m0 + + pextrw [r2], m0, 0 + pextrw [r2 + r3], m0, 1 + lea r2, [r2 + 2 * r3] + pextrw [r2], m0, 2 + pextrw [r2 + r3], m0, 3 + + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loopH + + RET +%endmacro + + FILTER_VER_CHROMA_SP_W2_4R 2, 4 + FILTER_VER_CHROMA_SP_W2_4R 2, 8 + + FILTER_VER_CHROMA_SP_W2_4R 2, 16 + +;-------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_sp_4x2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;-------------------------------------------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal interp_4tap_vert_sp_4x2, 5, 6, 5 + + add r1d, r1d + sub r0, r1 + shl r4d, 5 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + + mova m4, [pd_526336] + + movq m0, [r0] + movq m1, [r0 + r1] + punpcklwd m0, m1 ;m0=[0 1] + pmaddwd m0, [r5 + 0 *16] ;m0=[0+1] Row1 + + lea r0, [r0 + 2 * r1] + movq m2, [r0] + punpcklwd m1, m2 ;m1=[1 2] + pmaddwd m1, [r5 + 0 *16] ;m1=[1+2] Row2 + + movq m3, [r0 + r1] + punpcklwd m2, m3 ;m4=[2 3] + pmaddwd m2, [r5 + 1 * 16] + paddd m0, m2 ;m0=[0+1+2+3] Row1 done + paddd m0, m4 + psrad m0, 12 + + movq m2, [r0 + 2 * r1] + punpcklwd m3, m2 ;m5=[3 4] + pmaddwd m3, [r5 + 1 * 16] + paddd m1, m3 ;m1 = [1+2+3+4] Row2 done + paddd m1, m4 + psrad m1, 12 + + packssdw m0, m1 + packuswb m0, m0 + + movd [r2], m0 + pextrd [r2 + r3], m0, 1 + + RET + +;------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vertical_sp_6x%2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_SP_W6_H4 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_sp_6x%2, 5, 7, 7 + + add r1d, r1d + sub r0, r1 + shl r4d, 5 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r6, [r5 + r4] +%else + lea r6, [tab_ChromaCoeffV + r4] +%endif + + mova m6, [pd_526336] + + mov r4d, %2/4 + +.loopH: + PROCESS_CHROMA_SP_W4_4R + + paddd m0, m6 + paddd m1, m6 + paddd m2, m6 + paddd m3, m6 + + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 + + packssdw m0, m1 + packssdw m2, m3 + + packuswb m0, m2 + + movd [r2], m0 + pextrd [r2 + r3], m0, 1 + lea r5, [r2 + 2 * r3] + pextrd [r5], m0, 2 + pextrd [r5 + r3], m0, 3 + + lea r5, [4 * r1 - 2 * 4] + sub r0, r5 + add r2, 4 + + PROCESS_CHROMA_SP_W2_4R r6 + + paddd m0, m6 + paddd m2, m6 + + psrad m0, 12 + psrad m2, 12 + + packssdw m0, m2 + packuswb m0, m0 + + pextrw [r2], m0, 0 + pextrw [r2 + r3], m0, 1 + lea r2, [r2 + 2 * r3] + pextrw [r2], m0, 2 + pextrw [r2 + r3], m0, 3 + + sub r0, 2 * 4 + lea r2, [r2 + 2 * r3 - 4] + + dec r4d + jnz .loopH + + RET +%endmacro + + FILTER_VER_CHROMA_SP_W6_H4 6, 8 + + FILTER_VER_CHROMA_SP_W6_H4 6, 16 + +%macro PROCESS_CHROMA_SP_W8_2R 0 + movu m1, [r0] + movu m3, [r0 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, [r5 + 0 * 16] ;m0 = [0l+1l] Row1l + punpckhwd m1, m3 + pmaddwd m1, [r5 + 0 * 16] ;m1 = [0h+1h] Row1h + + movu m4, [r0 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, [r5 + 0 * 16] ;m2 = [1l+2l] Row2l + punpckhwd m3, m4 + pmaddwd m3, [r5 + 0 * 16] ;m3 = [1h+2h] Row2h + + lea r0, [r0 + 2 * r1] + movu m5, [r0 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + 1 * 16] ;m6 = [2l+3l] Row1l + paddd m0, m6 ;m0 = [0l+1l+2l+3l] Row1l sum + punpckhwd m4, m5 + pmaddwd m4, [r5 + 1 * 16] ;m6 = [2h+3h] Row1h + paddd m1, m4 ;m1 = [0h+1h+2h+3h] Row1h sum + + movu m4, [r0 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + 1 * 16] ;m6 = [3l+4l] Row2l + paddd m2, m6 ;m2 = [1l+2l+3l+4l] Row2l sum + punpckhwd m5, m4 + pmaddwd m5, [r5 + 1 * 16] ;m1 = [3h+4h] Row2h + paddd m3, m5 ;m3 = [1h+2h+3h+4h] Row2h sum +%endmacro + +;-------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_sp_8x%2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;-------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_SP_W8_H2 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_sp_%1x%2, 5, 6, 8 + + add r1d, r1d + sub r0, r1 + shl r4d, 5 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + + mova m7, [pd_526336] + + mov r4d, %2/2 +.loopH: + PROCESS_CHROMA_SP_W8_2R + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 + + packssdw m0, m1 + packssdw m2, m3 + + packuswb m0, m2 + + movlps [r2], m0 + movhps [r2 + r3], m0 + + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loopH + + RET +%endmacro + + FILTER_VER_CHROMA_SP_W8_H2 8, 2 + FILTER_VER_CHROMA_SP_W8_H2 8, 4 + FILTER_VER_CHROMA_SP_W8_H2 8, 6 + FILTER_VER_CHROMA_SP_W8_H2 8, 8 + FILTER_VER_CHROMA_SP_W8_H2 8, 16 + FILTER_VER_CHROMA_SP_W8_H2 8, 32 + + FILTER_VER_CHROMA_SP_W8_H2 8, 12 + FILTER_VER_CHROMA_SP_W8_H2 8, 64 + + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_2x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- +%macro FILTER_HORIZ_CHROMA_2xN 2 +INIT_XMM sse4 +cglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 4, src, srcstride, dst, dststride +%define coef2 m3 +%define Tm0 m2 +%define t1 m1 +%define t0 m0 + + dec srcq + mov r4d, r4m + add dststrided, dststrided + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + movd coef2, [r6 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufd coef2, coef2, 0 + mova t1, [pw_2000] + mova Tm0, [tab_Tm] + + mov r4d, %2 + cmp r5m, byte 0 + je .loopH + sub srcq, srcstrideq + add r4d, 3 + +.loopH: + movh t0, [srcq] + pshufb t0, t0, Tm0 + pmaddubsw t0, coef2 + phaddw t0, t0 + psubw t0, t1 + movd [dstq], t0 + + lea srcq, [srcq + srcstrideq] + lea dstq, [dstq + dststrideq] + + dec r4d + jnz .loopH + + RET +%endmacro + + FILTER_HORIZ_CHROMA_2xN 2, 4 + FILTER_HORIZ_CHROMA_2xN 2, 8 + + FILTER_HORIZ_CHROMA_2xN 2, 16 + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_4x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- +%macro FILTER_HORIZ_CHROMA_4xN 2 +INIT_XMM sse4 +cglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 4, src, srcstride, dst, dststride +%define coef2 m3 +%define Tm0 m2 +%define t1 m1 +%define t0 m0 + + dec srcq + mov r4d, r4m + add dststrided, dststrided + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + movd coef2, [r6 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufd coef2, coef2, 0 + mova t1, [pw_2000] + mova Tm0, [tab_Tm] + + mov r4d, %2 + cmp r5m, byte 0 + je .loopH + sub srcq, srcstrideq + add r4d, 3 + +.loopH: + movh t0, [srcq] + pshufb t0, t0, Tm0 + pmaddubsw t0, coef2 + phaddw t0, t0 + psubw t0, t1 + movlps [dstq], t0 + + lea srcq, [srcq + srcstrideq] + lea dstq, [dstq + dststrideq] + + dec r4d + jnz .loopH + RET +%endmacro + + FILTER_HORIZ_CHROMA_4xN 4, 2 + FILTER_HORIZ_CHROMA_4xN 4, 4 + FILTER_HORIZ_CHROMA_4xN 4, 8 + FILTER_HORIZ_CHROMA_4xN 4, 16 + + FILTER_HORIZ_CHROMA_4xN 4, 32 + +%macro PROCESS_CHROMA_W6 3 + movu %1, [srcq] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + psubw %2, %3 + movh [dstq], %2 + pshufd %2, %2, 2 + movd [dstq + 8], %2 +%endmacro + +%macro PROCESS_CHROMA_W12 3 + movu %1, [srcq] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + psubw %2, %3 + movu [dstq], %2 + movu %1, [srcq + 8] + pshufb %1, %1, Tm0 + pmaddubsw %1, coef2 + phaddw %1, %1 + psubw %1, %3 + movh [dstq + 16], %1 +%endmacro + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_6x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- +%macro FILTER_HORIZ_CHROMA 2 +INIT_XMM sse4 +cglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 6, src, srcstride, dst, dststride +%define coef2 m5 +%define Tm0 m4 +%define Tm1 m3 +%define t2 m2 +%define t1 m1 +%define t0 m0 + + dec srcq + mov r4d, r4m + add dststrided, dststrided + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + movd coef2, [r6 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufd coef2, coef2, 0 + mova t2, [pw_2000] + mova Tm0, [tab_Tm] + mova Tm1, [tab_Tm + 16] + + mov r4d, %2 + cmp r5m, byte 0 + je .loopH + sub srcq, srcstrideq + add r4d, 3 + +.loopH: + PROCESS_CHROMA_W%1 t0, t1, t2 + add srcq, srcstrideq + add dstq, dststrideq + + dec r4d + jnz .loopH + + RET +%endmacro + + FILTER_HORIZ_CHROMA 6, 8 + FILTER_HORIZ_CHROMA 12, 16 + + FILTER_HORIZ_CHROMA 6, 16 + FILTER_HORIZ_CHROMA 12, 32 + +%macro PROCESS_CHROMA_W8 3 + movu %1, [srcq] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + psubw %2, %3 + movu [dstq], %2 +%endmacro + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- +%macro FILTER_HORIZ_CHROMA_8xN 2 +INIT_XMM sse4 +cglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 6, src, srcstride, dst, dststride +%define coef2 m5 +%define Tm0 m4 +%define Tm1 m3 +%define t2 m2 +%define t1 m1 +%define t0 m0 + + dec srcq + mov r4d, r4m + add dststrided, dststrided + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + movd coef2, [r6 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufd coef2, coef2, 0 + mova t2, [pw_2000] + mova Tm0, [tab_Tm] + mova Tm1, [tab_Tm + 16] + + mov r4d, %2 + cmp r5m, byte 0 + je .loopH + sub srcq, srcstrideq + add r4d, 3 + +.loopH: + PROCESS_CHROMA_W8 t0, t1, t2 + add srcq, srcstrideq + add dstq, dststrideq + + dec r4d + jnz .loopH + + RET +%endmacro + + FILTER_HORIZ_CHROMA_8xN 8, 2 + FILTER_HORIZ_CHROMA_8xN 8, 4 + FILTER_HORIZ_CHROMA_8xN 8, 6 + FILTER_HORIZ_CHROMA_8xN 8, 8 + FILTER_HORIZ_CHROMA_8xN 8, 16 + FILTER_HORIZ_CHROMA_8xN 8, 32 + + FILTER_HORIZ_CHROMA_8xN 8, 12 + FILTER_HORIZ_CHROMA_8xN 8, 64 + +%macro PROCESS_CHROMA_W16 4 + movu %1, [srcq] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + movu %1, [srcq + 8] + pshufb %4, %1, Tm0 + pmaddubsw %4, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %4, %1 + psubw %2, %3 + psubw %4, %3 + movu [dstq], %2 + movu [dstq + 16], %4 +%endmacro + +%macro PROCESS_CHROMA_W24 4 + movu %1, [srcq] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + movu %1, [srcq + 8] + pshufb %4, %1, Tm0 + pmaddubsw %4, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %4, %1 + psubw %2, %3 + psubw %4, %3 + movu [dstq], %2 + movu [dstq + 16], %4 + movu %1, [srcq + 16] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + psubw %2, %3 + movu [dstq + 32], %2 +%endmacro + +%macro PROCESS_CHROMA_W32 4 + movu %1, [srcq] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + movu %1, [srcq + 8] + pshufb %4, %1, Tm0 + pmaddubsw %4, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %4, %1 + psubw %2, %3 + psubw %4, %3 + movu [dstq], %2 + movu [dstq + 16], %4 + movu %1, [srcq + 16] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + movu %1, [srcq + 24] + pshufb %4, %1, Tm0 + pmaddubsw %4, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %4, %1 + psubw %2, %3 + psubw %4, %3 + movu [dstq + 32], %2 + movu [dstq + 48], %4 +%endmacro + +%macro PROCESS_CHROMA_W16o 5 + movu %1, [srcq + %5] + pshufb %2, %1, Tm0 + pmaddubsw %2, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %2, %1 + movu %1, [srcq + %5 + 8] + pshufb %4, %1, Tm0 + pmaddubsw %4, coef2 + pshufb %1, %1, Tm1 + pmaddubsw %1, coef2 + phaddw %4, %1 + psubw %2, %3 + psubw %4, %3 + movu [dstq + %5 * 2], %2 + movu [dstq + %5 * 2 + 16], %4 +%endmacro + +%macro PROCESS_CHROMA_W48 4 + PROCESS_CHROMA_W16o %1, %2, %3, %4, 0 + PROCESS_CHROMA_W16o %1, %2, %3, %4, 16 + PROCESS_CHROMA_W16o %1, %2, %3, %4, 32 +%endmacro + +%macro PROCESS_CHROMA_W64 4 + PROCESS_CHROMA_W16o %1, %2, %3, %4, 0 + PROCESS_CHROMA_W16o %1, %2, %3, %4, 16 + PROCESS_CHROMA_W16o %1, %2, %3, %4, 32 + PROCESS_CHROMA_W16o %1, %2, %3, %4, 48 +%endmacro + +;------------------------------------------------------------------------------------------------------------------------------ +; void interp_4tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;------------------------------------------------------------------------------------------------------------------------------ +%macro FILTER_HORIZ_CHROMA_WxN 2 +INIT_XMM sse4 +cglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 7, src, srcstride, dst, dststride +%define coef2 m6 +%define Tm0 m5 +%define Tm1 m4 +%define t3 m3 +%define t2 m2 +%define t1 m1 +%define t0 m0 + + dec srcq + mov r4d, r4m + add dststrided, dststrided + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + movd coef2, [r6 + r4 * 4] +%else + movd coef2, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufd coef2, coef2, 0 + mova t2, [pw_2000] + mova Tm0, [tab_Tm] + mova Tm1, [tab_Tm + 16] + + mov r4d, %2 + cmp r5m, byte 0 + je .loopH + sub srcq, srcstrideq + add r4d, 3 + +.loopH: + PROCESS_CHROMA_W%1 t0, t1, t2, t3 + add srcq, srcstrideq + add dstq, dststrideq + + dec r4d + jnz .loopH + + RET +%endmacro + + FILTER_HORIZ_CHROMA_WxN 16, 4 + FILTER_HORIZ_CHROMA_WxN 16, 8 + FILTER_HORIZ_CHROMA_WxN 16, 12 + FILTER_HORIZ_CHROMA_WxN 16, 16 + FILTER_HORIZ_CHROMA_WxN 16, 32 + FILTER_HORIZ_CHROMA_WxN 24, 32 + FILTER_HORIZ_CHROMA_WxN 32, 8 + FILTER_HORIZ_CHROMA_WxN 32, 16 + FILTER_HORIZ_CHROMA_WxN 32, 24 + FILTER_HORIZ_CHROMA_WxN 32, 32 + + FILTER_HORIZ_CHROMA_WxN 16, 24 + FILTER_HORIZ_CHROMA_WxN 16, 64 + FILTER_HORIZ_CHROMA_WxN 24, 64 + FILTER_HORIZ_CHROMA_WxN 32, 48 + FILTER_HORIZ_CHROMA_WxN 32, 64 + + FILTER_HORIZ_CHROMA_WxN 64, 64 + FILTER_HORIZ_CHROMA_WxN 64, 32 + FILTER_HORIZ_CHROMA_WxN 64, 48 + FILTER_HORIZ_CHROMA_WxN 48, 64 + FILTER_HORIZ_CHROMA_WxN 64, 16 + + +;--------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_ps_%1x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;--------------------------------------------------------------------------------------------------------------- +%macro FILTER_V_PS_W16n 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_ps_%1x%2, 4, 7, 8 + + mov r4d, r4m + sub r0, r1 + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] + mov r4d, %2/2 + +.loop: + + mov r6d, %1/16 + +.loopW: + + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + pmaddubsw m4, m1 + pmaddubsw m2, m1 + + lea r5, [r0 + 2 * r1] + movu m5, [r5] + movu m7, [r5 + r1] + + punpcklbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m4, m6 + + punpckhbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m2, m6 + + mova m6, [pw_2000] + + psubw m4, m6 + psubw m2, m6 + + movu [r2], m4 + movu [r2 + 16], m2 + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + pmaddubsw m4, m1 + pmaddubsw m3, m1 + + movu m5, [r5 + 2 * r1] + + punpcklbw m2, m7, m5 + punpckhbw m7, m5 + + pmaddubsw m2, m0 + pmaddubsw m7, m0 + + paddw m4, m2 + paddw m3, m7 + + psubw m4, m6 + psubw m3, m6 + + movu [r2 + r3], m4 + movu [r2 + r3 + 16], m3 + + add r0, 16 + add r2, 32 + dec r6d + jnz .loopW + + lea r0, [r0 + r1 * 2 - %1] + lea r2, [r2 + r3 * 2 - %1 * 2] + + dec r4d + jnz .loop + RET +%endmacro + + FILTER_V_PS_W16n 64, 64 + FILTER_V_PS_W16n 64, 32 + FILTER_V_PS_W16n 64, 48 + FILTER_V_PS_W16n 48, 64 + FILTER_V_PS_W16n 64, 16 + + +;------------------------------------------------------------------------------------------------------------ +;void interp_4tap_vert_ps_2x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------ +INIT_XMM sse4 +cglobal interp_4tap_vert_ps_2x4, 4, 6, 7 + + mov r4d, r4m + sub r0, r1 + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m0, [tab_Cm] + + lea r5, [3 * r1] + + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r5] + + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklbw m2, m6 + + pmaddubsw m2, m0 + + lea r0, [r0 + 4 * r1] + movd m6, [r0] + + punpcklbw m3, m4 + punpcklbw m1, m5, m6 + punpcklbw m3, m1 + + pmaddubsw m3, m0 + phaddw m2, m3 + + mova m1, [pw_2000] + + psubw m2, m1 + + movd [r2], m2 + pextrd [r2 + r3], m2, 2 + + movd m2, [r0 + r1] + + punpcklbw m4, m5 + punpcklbw m3, m6, m2 + punpcklbw m4, m3 + + pmaddubsw m4, m0 + + movd m3, [r0 + 2 * r1] + + punpcklbw m5, m6 + punpcklbw m2, m3 + punpcklbw m5, m2 + + pmaddubsw m5, m0 + phaddw m4, m5 + psubw m4, m1 + + lea r2, [r2 + 2 * r3] + movd [r2], m4 + pextrd [r2 + r3], m4, 2 + + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_ps_2x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------- +%macro FILTER_V_PS_W2 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_ps_2x%2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] +%else + movd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + pshufb m0, [tab_Cm] + + mova m1, [pw_2000] + lea r5, [3 * r1] + mov r4d, %2/4 +.loop: + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r5] + + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklbw m2, m6 + + pmaddubsw m2, m0 + + lea r0, [r0 + 4 * r1] + movd m6, [r0] + + punpcklbw m3, m4 + punpcklbw m7, m5, m6 + punpcklbw m3, m7 + + pmaddubsw m3, m0 + + phaddw m2, m3 + psubw m2, m1 + + + movd [r2], m2 + pshufd m2, m2, 2 + movd [r2 + r3], m2 + + movd m2, [r0 + r1] + + punpcklbw m4, m5 + punpcklbw m3, m6, m2 + punpcklbw m4, m3 + + pmaddubsw m4, m0 + + movd m3, [r0 + 2 * r1] + + punpcklbw m5, m6 + punpcklbw m2, m3 + punpcklbw m5, m2 + + pmaddubsw m5, m0 + + phaddw m4, m5 + + psubw m4, m1 + + lea r2, [r2 + 2 * r3] + movd [r2], m4 + pshufd m4 , m4 ,2 + movd [r2 + r3], m4 + + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loop + + RET +%endmacro + + FILTER_V_PS_W2 2, 8 + + FILTER_V_PS_W2 2, 16 + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_SS 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_ss_%1x%2, 5, 7, 6 ,0-gprsize + + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 5 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r6, [r5 + r4] +%else + lea r6, [tab_ChromaCoeffV + r4] +%endif + + mov dword [rsp], %2/4 + +.loopH: + mov r4d, (%1/4) +.loopW: + PROCESS_CHROMA_SP_W4_4R + + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 + + movlps [r2], m0 + movhps [r2 + r3], m0 + lea r5, [r2 + 2 * r3] + movlps [r5], m2 + movhps [r5 + r3], m2 + + lea r5, [4 * r1 - 2 * 4] + sub r0, r5 + add r2, 2 * 4 + + dec r4d + jnz .loopW + + lea r0, [r0 + 4 * r1 - 2 * %1] + lea r2, [r2 + 4 * r3 - 2 * %1] + + dec dword [rsp] + jnz .loopH + + RET +%endmacro + + FILTER_VER_CHROMA_SS 4, 4 + FILTER_VER_CHROMA_SS 4, 8 + FILTER_VER_CHROMA_SS 16, 16 + FILTER_VER_CHROMA_SS 16, 8 + FILTER_VER_CHROMA_SS 16, 12 + FILTER_VER_CHROMA_SS 12, 16 + FILTER_VER_CHROMA_SS 16, 4 + FILTER_VER_CHROMA_SS 4, 16 + FILTER_VER_CHROMA_SS 32, 32 + FILTER_VER_CHROMA_SS 32, 16 + FILTER_VER_CHROMA_SS 16, 32 + FILTER_VER_CHROMA_SS 32, 24 + FILTER_VER_CHROMA_SS 24, 32 + FILTER_VER_CHROMA_SS 32, 8 + + FILTER_VER_CHROMA_SS 16, 24 + FILTER_VER_CHROMA_SS 12, 32 + FILTER_VER_CHROMA_SS 4, 32 + FILTER_VER_CHROMA_SS 32, 64 + FILTER_VER_CHROMA_SS 16, 64 + FILTER_VER_CHROMA_SS 32, 48 + FILTER_VER_CHROMA_SS 24, 64 + + FILTER_VER_CHROMA_SS 64, 64 + FILTER_VER_CHROMA_SS 64, 32 + FILTER_VER_CHROMA_SS 64, 48 + FILTER_VER_CHROMA_SS 48, 64 + FILTER_VER_CHROMA_SS 64, 16 + +%macro FILTER_VER_CHROMA_S_AVX2_4x4 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x4, 4, 6, 7 + mov r4d, r4m + add r1d, r1d + shl r4d, 6 + sub r0, r1 + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] +%ifidn %1,sp + mova m6, [pd_526336] +%else + add r3d, r3d +%endif + + movq xm0, [r0] + movq xm1, [r0 + r1] + punpcklwd xm0, xm1 + movq xm2, [r0 + r1 * 2] + punpcklwd xm1, xm2 + vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] + pmaddwd m0, [r5] + movq xm3, [r0 + r4] + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movq xm4, [r0] + punpcklwd xm3, xm4 + vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] + pmaddwd m5, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m5 + movq xm3, [r0 + r1] + punpcklwd xm4, xm3 + movq xm1, [r0 + r1 * 2] + punpcklwd xm3, xm1 + vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] + pmaddwd m4, [r5 + 1 * mmsize] + paddd m2, m4 + +%ifidn %1,sp + paddd m0, m6 + paddd m2, m6 + psrad m0, 12 + psrad m2, 12 +%else + psrad m0, 6 + psrad m2, 6 +%endif + packssdw m0, m2 + vextracti128 xm2, m0, 1 + lea r4, [r3 * 3] + +%ifidn %1,sp + packuswb xm0, xm2 + movd [r2], xm0 + pextrd [r2 + r3], xm0, 2 + pextrd [r2 + r3 * 2], xm0, 1 + pextrd [r2 + r4], xm0, 3 +%else + movq [r2], xm0 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r4], xm2 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_S_AVX2_4x4 sp + FILTER_VER_CHROMA_S_AVX2_4x4 ss + +%macro FILTER_VER_CHROMA_S_AVX2_4x8 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x8, 4, 6, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + sub r0, r1 + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] +%ifidn %1,sp + mova m7, [pd_526336] +%else + add r3d, r3d +%endif + + movq xm0, [r0] + movq xm1, [r0 + r1] + punpcklwd xm0, xm1 + movq xm2, [r0 + r1 * 2] + punpcklwd xm1, xm2 + vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] + pmaddwd m0, [r5] + movq xm3, [r0 + r4] + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movq xm4, [r0] + punpcklwd xm3, xm4 + vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] + pmaddwd m5, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m5 + movq xm3, [r0 + r1] + punpcklwd xm4, xm3 + movq xm1, [r0 + r1 * 2] + punpcklwd xm3, xm1 + vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] + pmaddwd m5, m4, [r5 + 1 * mmsize] + paddd m2, m5 + pmaddwd m4, [r5] + movq xm3, [r0 + r4] + punpcklwd xm1, xm3 + lea r0, [r0 + 4 * r1] + movq xm6, [r0] + punpcklwd xm3, xm6 + vinserti128 m1, m1, xm3, 1 ; m1 = [8 7 7 6] + pmaddwd m5, m1, [r5 + 1 * mmsize] + paddd m4, m5 + pmaddwd m1, [r5] + movq xm3, [r0 + r1] + punpcklwd xm6, xm3 + movq xm5, [r0 + 2 * r1] + punpcklwd xm3, xm5 + vinserti128 m6, m6, xm3, 1 ; m6 = [A 9 9 8] + pmaddwd m6, [r5 + 1 * mmsize] + paddd m1, m6 + lea r4, [r3 * 3] + +%ifidn %1,sp + paddd m0, m7 + paddd m2, m7 + paddd m4, m7 + paddd m1, m7 + psrad m0, 12 + psrad m2, 12 + psrad m4, 12 + psrad m1, 12 +%else + psrad m0, 6 + psrad m2, 6 + psrad m4, 6 + psrad m1, 6 +%endif + packssdw m0, m2 + packssdw m4, m1 +%ifidn %1,sp + packuswb m0, m4 + vextracti128 xm2, m0, 1 + movd [r2], xm0 + movd [r2 + r3], xm2 + pextrd [r2 + r3 * 2], xm0, 1 + pextrd [r2 + r4], xm2, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm0, 2 + pextrd [r2 + r3], xm2, 2 + pextrd [r2 + r3 * 2], xm0, 3 + pextrd [r2 + r4], xm2, 3 +%else + vextracti128 xm2, m0, 1 + vextracti128 xm1, m4, 1 + movq [r2], xm0 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r4], xm2 + lea r2, [r2 + r3 * 4] + movq [r2], xm4 + movq [r2 + r3], xm1 + movhps [r2 + r3 * 2], xm4 + movhps [r2 + r4], xm1 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_S_AVX2_4x8 sp + FILTER_VER_CHROMA_S_AVX2_4x8 ss + +%macro PROCESS_CHROMA_AVX2_W4_16R 1 + movq xm0, [r0] + movq xm1, [r0 + r1] + punpcklwd xm0, xm1 + movq xm2, [r0 + r1 * 2] + punpcklwd xm1, xm2 + vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] + pmaddwd m0, [r5] + movq xm3, [r0 + r4] + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movq xm4, [r0] + punpcklwd xm3, xm4 + vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] + pmaddwd m5, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m5 + movq xm3, [r0 + r1] + punpcklwd xm4, xm3 + movq xm1, [r0 + r1 * 2] + punpcklwd xm3, xm1 + vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] + pmaddwd m5, m4, [r5 + 1 * mmsize] + paddd m2, m5 + pmaddwd m4, [r5] + movq xm3, [r0 + r4] + punpcklwd xm1, xm3 + lea r0, [r0 + 4 * r1] + movq xm6, [r0] + punpcklwd xm3, xm6 + vinserti128 m1, m1, xm3, 1 ; m1 = [8 7 7 6] + pmaddwd m5, m1, [r5 + 1 * mmsize] + paddd m4, m5 + pmaddwd m1, [r5] + movq xm3, [r0 + r1] + punpcklwd xm6, xm3 + movq xm5, [r0 + 2 * r1] + punpcklwd xm3, xm5 + vinserti128 m6, m6, xm3, 1 ; m6 = [10 9 9 8] + pmaddwd m3, m6, [r5 + 1 * mmsize] + paddd m1, m3 + pmaddwd m6, [r5] + +%ifidn %1,sp + paddd m0, m7 + paddd m2, m7 + paddd m4, m7 + paddd m1, m7 + psrad m4, 12 + psrad m1, 12 + psrad m0, 12 + psrad m2, 12 +%else + psrad m0, 6 + psrad m2, 6 + psrad m4, 6 + psrad m1, 6 +%endif + packssdw m0, m2 + packssdw m4, m1 +%ifidn %1,sp + packuswb m0, m4 + vextracti128 xm4, m0, 1 + movd [r2], xm0 + movd [r2 + r3], xm4 + pextrd [r2 + r3 * 2], xm0, 1 + pextrd [r2 + r6], xm4, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm0, 2 + pextrd [r2 + r3], xm4, 2 + pextrd [r2 + r3 * 2], xm0, 3 + pextrd [r2 + r6], xm4, 3 +%else + vextracti128 xm2, m0, 1 + vextracti128 xm1, m4, 1 + movq [r2], xm0 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r6], xm2 + lea r2, [r2 + r3 * 4] + movq [r2], xm4 + movq [r2 + r3], xm1 + movhps [r2 + r3 * 2], xm4 + movhps [r2 + r6], xm1 +%endif + + movq xm2, [r0 + r4] + punpcklwd xm5, xm2 + lea r0, [r0 + 4 * r1] + movq xm0, [r0] + punpcklwd xm2, xm0 + vinserti128 m5, m5, xm2, 1 ; m5 = [12 11 11 10] + pmaddwd m2, m5, [r5 + 1 * mmsize] + paddd m6, m2 + pmaddwd m5, [r5] + movq xm2, [r0 + r1] + punpcklwd xm0, xm2 + movq xm3, [r0 + 2 * r1] + punpcklwd xm2, xm3 + vinserti128 m0, m0, xm2, 1 ; m0 = [14 13 13 12] + pmaddwd m2, m0, [r5 + 1 * mmsize] + paddd m5, m2 + pmaddwd m0, [r5] + movq xm4, [r0 + r4] + punpcklwd xm3, xm4 + lea r0, [r0 + 4 * r1] + movq xm1, [r0] + punpcklwd xm4, xm1 + vinserti128 m3, m3, xm4, 1 ; m3 = [16 15 15 14] + pmaddwd m4, m3, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m3, [r5] + movq xm4, [r0 + r1] + punpcklwd xm1, xm4 + movq xm2, [r0 + 2 * r1] + punpcklwd xm4, xm2 + vinserti128 m1, m1, xm4, 1 ; m1 = [18 17 17 16] + pmaddwd m1, [r5 + 1 * mmsize] + paddd m3, m1 + +%ifidn %1,sp + paddd m6, m7 + paddd m5, m7 + paddd m0, m7 + paddd m3, m7 + psrad m6, 12 + psrad m5, 12 + psrad m0, 12 + psrad m3, 12 +%else + psrad m6, 6 + psrad m5, 6 + psrad m0, 6 + psrad m3, 6 +%endif + packssdw m6, m5 + packssdw m0, m3 + lea r2, [r2 + r3 * 4] + +%ifidn %1,sp + packuswb m6, m0 + vextracti128 xm0, m6, 1 + movd [r2], xm6 + movd [r2 + r3], xm0 + pextrd [r2 + r3 * 2], xm6, 1 + pextrd [r2 + r6], xm0, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm6, 2 + pextrd [r2 + r3], xm0, 2 + pextrd [r2 + r3 * 2], xm6, 3 + pextrd [r2 + r6], xm0, 3 +%else + vextracti128 xm5, m6, 1 + vextracti128 xm3, m0, 1 + movq [r2], xm6 + movq [r2 + r3], xm5 + movhps [r2 + r3 * 2], xm6 + movhps [r2 + r6], xm5 + lea r2, [r2 + r3 * 4] + movq [r2], xm0 + movq [r2 + r3], xm3 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r6], xm3 +%endif +%endmacro + +%macro FILTER_VER_CHROMA_S_AVX2_4x16 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x16, 4, 7, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + sub r0, r1 + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] +%ifidn %1,sp + mova m7, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] + PROCESS_CHROMA_AVX2_W4_16R %1 + RET +%endmacro + + FILTER_VER_CHROMA_S_AVX2_4x16 sp + FILTER_VER_CHROMA_S_AVX2_4x16 ss + +%macro FILTER_VER_CHROMA_S_AVX2_4x32 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x32, 4, 7, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + sub r0, r1 + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] +%ifidn %1,sp + mova m7, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] +%rep 2 + PROCESS_CHROMA_AVX2_W4_16R %1 + lea r2, [r2 + r3 * 4] +%endrep + RET +%endmacro + + FILTER_VER_CHROMA_S_AVX2_4x32 sp + FILTER_VER_CHROMA_S_AVX2_4x32 ss + +%macro FILTER_VER_CHROMA_S_AVX2_4x2 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x2, 4, 6, 6 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + sub r0, r1 + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] +%ifidn %1,sp + mova m5, [pd_526336] +%else + add r3d, r3d +%endif + movq xm0, [r0] + movq xm1, [r0 + r1] + punpcklwd xm0, xm1 + movq xm2, [r0 + r1 * 2] + punpcklwd xm1, xm2 + vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] + pmaddwd m0, [r5] + movq xm3, [r0 + r4] + punpcklwd xm2, xm3 + movq xm4, [r0 + 4 * r1] + punpcklwd xm3, xm4 + vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] + pmaddwd m2, [r5 + 1 * mmsize] + paddd m0, m2 +%ifidn %1,sp + paddd m0, m5 + psrad m0, 12 +%else + psrad m0, 6 +%endif + vextracti128 xm1, m0, 1 + packssdw xm0, xm1 +%ifidn %1,sp + packuswb xm0, xm0 + movd [r2], xm0 + pextrd [r2 + r3], xm0, 1 +%else + movq [r2], xm0 + movhps [r2 + r3], xm0 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_S_AVX2_4x2 sp + FILTER_VER_CHROMA_S_AVX2_4x2 ss + +%macro FILTER_VER_CHROMA_S_AVX2_2x4 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_2x4, 4, 6, 6 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + sub r0, r1 + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] +%ifidn %1,sp + mova m5, [pd_526336] +%else + add r3d, r3d +%endif + movd xm0, [r0] + movd xm1, [r0 + r1] + punpcklwd xm0, xm1 + movd xm2, [r0 + r1 * 2] + punpcklwd xm1, xm2 + punpcklqdq xm0, xm1 ; m0 = [2 1 1 0] + movd xm3, [r0 + r4] + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movd xm4, [r0] + punpcklwd xm3, xm4 + punpcklqdq xm2, xm3 ; m2 = [4 3 3 2] + vinserti128 m0, m0, xm2, 1 ; m0 = [4 3 3 2 2 1 1 0] + movd xm1, [r0 + r1] + punpcklwd xm4, xm1 + movd xm3, [r0 + r1 * 2] + punpcklwd xm1, xm3 + punpcklqdq xm4, xm1 ; m4 = [6 5 5 4] + vinserti128 m2, m2, xm4, 1 ; m2 = [6 5 5 4 4 3 3 2] + pmaddwd m0, [r5] + pmaddwd m2, [r5 + 1 * mmsize] + paddd m0, m2 +%ifidn %1,sp + paddd m0, m5 + psrad m0, 12 +%else + psrad m0, 6 +%endif + vextracti128 xm1, m0, 1 + packssdw xm0, xm1 + lea r4, [r3 * 3] +%ifidn %1,sp + packuswb xm0, xm0 + pextrw [r2], xm0, 0 + pextrw [r2 + r3], xm0, 1 + pextrw [r2 + 2 * r3], xm0, 2 + pextrw [r2 + r4], xm0, 3 +%else + movd [r2], xm0 + pextrd [r2 + r3], xm0, 1 + pextrd [r2 + 2 * r3], xm0, 2 + pextrd [r2 + r4], xm0, 3 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_S_AVX2_2x4 sp + FILTER_VER_CHROMA_S_AVX2_2x4 ss + +%macro FILTER_VER_CHROMA_S_AVX2_8x8 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x8, 4, 6, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m7, [pd_526336] +%else + add r3d, r3d +%endif + + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m4 + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + pmaddwd m3, [r5] + paddd m1, m5 +%ifidn %1,sp + paddd m0, m7 + paddd m1, m7 + psrad m0, 12 + psrad m1, 12 +%else + psrad m0, 6 + psrad m1, 6 +%endif + packssdw m0, m1 + + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm1, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m1 +%ifidn %1,sp + paddd m2, m7 + paddd m3, m7 + psrad m2, 12 + psrad m3, 12 +%else + psrad m2, 6 + psrad m3, 6 +%endif + packssdw m2, m3 + + movu xm1, [r0 + r4] ; m1 = row 7 + punpckhwd xm3, xm6, xm1 + punpcklwd xm6, xm1 + vinserti128 m6, m6, xm3, 1 + pmaddwd m3, m6, [r5 + 1 * mmsize] + pmaddwd m6, [r5] + paddd m4, m3 + + lea r4, [r3 * 3] +%ifidn %1,sp + packuswb m0, m2 + mova m3, [interp8_hps_shuf] + vpermd m0, m3, m0 + vextracti128 xm2, m0, 1 + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r4], xm2 +%else + vpermq m0, m0, 11011000b + vpermq m2, m2, 11011000b + movu [r2], xm0 + vextracti128 xm0, m0, 1 + vextracti128 xm3, m2, 1 + movu [r2 + r3], xm0 + movu [r2 + r3 * 2], xm2 + movu [r2 + r4], xm3 +%endif + lea r2, [r2 + r3 * 4] + lea r0, [r0 + r1 * 4] + movu xm0, [r0] ; m0 = row 8 + punpckhwd xm2, xm1, xm0 + punpcklwd xm1, xm0 + vinserti128 m1, m1, xm2, 1 + pmaddwd m2, m1, [r5 + 1 * mmsize] + pmaddwd m1, [r5] + paddd m5, m2 +%ifidn %1,sp + paddd m4, m7 + paddd m5, m7 + psrad m4, 12 + psrad m5, 12 +%else + psrad m4, 6 + psrad m5, 6 +%endif + packssdw m4, m5 + + movu xm2, [r0 + r1] ; m2 = row 9 + punpckhwd xm5, xm0, xm2 + punpcklwd xm0, xm2 + vinserti128 m0, m0, xm5, 1 + pmaddwd m0, [r5 + 1 * mmsize] + paddd m6, m0 + movu xm5, [r0 + r1 * 2] ; m5 = row 10 + punpckhwd xm0, xm2, xm5 + punpcklwd xm2, xm5 + vinserti128 m2, m2, xm0, 1 + pmaddwd m2, [r5 + 1 * mmsize] + paddd m1, m2 + +%ifidn %1,sp + paddd m6, m7 + paddd m1, m7 + psrad m6, 12 + psrad m1, 12 +%else + psrad m6, 6 + psrad m1, 6 +%endif + packssdw m6, m1 +%ifidn %1,sp + packuswb m4, m6 + vpermd m4, m3, m4 + vextracti128 xm6, m4, 1 + movq [r2], xm4 + movhps [r2 + r3], xm4 + movq [r2 + r3 * 2], xm6 + movhps [r2 + r4], xm6 +%else + vpermq m4, m4, 11011000b + vpermq m6, m6, 11011000b + vextracti128 xm5, m4, 1 + vextracti128 xm1, m6, 1 + movu [r2], xm4 + movu [r2 + r3], xm5 + movu [r2 + r3 * 2], xm6 + movu [r2 + r4], xm1 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_S_AVX2_8x8 sp + FILTER_VER_CHROMA_S_AVX2_8x8 ss + +%macro PROCESS_CHROMA_S_AVX2_W8_16R 1 + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + lea r7, [r0 + r1 * 4] + movu xm4, [r7] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] + movu xm5, [r7 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + movu xm6, [r7 + r1 * 2] ; m6 = row 6 + punpckhwd xm7, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddwd m7, m5, [r5 + 1 * mmsize] + paddd m3, m7 + pmaddwd m5, [r5] +%ifidn %1,sp + paddd m0, m9 + paddd m1, m9 + paddd m2, m9 + paddd m3, m9 + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 +%endif + packssdw m0, m1 + packssdw m2, m3 +%ifidn %1,sp + packuswb m0, m2 + mova m3, [interp8_hps_shuf] + vpermd m0, m3, m0 + vextracti128 xm2, m0, 1 + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r6], xm2 +%else + vpermq m0, m0, 11011000b + vpermq m2, m2, 11011000b + vextracti128 xm1, m0, 1 + vextracti128 xm3, m2, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm3 +%endif + + movu xm7, [r7 + r4] ; m7 = row 7 + punpckhwd xm8, xm6, xm7 + punpcklwd xm6, xm7 + vinserti128 m6, m6, xm8, 1 + pmaddwd m8, m6, [r5 + 1 * mmsize] + paddd m4, m8 + pmaddwd m6, [r5] + lea r7, [r7 + r1 * 4] + movu xm8, [r7] ; m8 = row 8 + punpckhwd xm0, xm7, xm8 + punpcklwd xm7, xm8 + vinserti128 m7, m7, xm0, 1 + pmaddwd m0, m7, [r5 + 1 * mmsize] + paddd m5, m0 + pmaddwd m7, [r5] + movu xm0, [r7 + r1] ; m0 = row 9 + punpckhwd xm1, xm8, xm0 + punpcklwd xm8, xm0 + vinserti128 m8, m8, xm1, 1 + pmaddwd m1, m8, [r5 + 1 * mmsize] + paddd m6, m1 + pmaddwd m8, [r5] + movu xm1, [r7 + r1 * 2] ; m1 = row 10 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m2, m0, [r5 + 1 * mmsize] + paddd m7, m2 + pmaddwd m0, [r5] +%ifidn %1,sp + paddd m4, m9 + paddd m5, m9 + psrad m4, 12 + psrad m5, 12 + paddd m6, m9 + paddd m7, m9 + psrad m6, 12 + psrad m7, 12 +%else + psrad m4, 6 + psrad m5, 6 + psrad m6, 6 + psrad m7, 6 +%endif + packssdw m4, m5 + packssdw m6, m7 + lea r8, [r2 + r3 * 4] +%ifidn %1,sp + packuswb m4, m6 + vpermd m4, m3, m4 + vextracti128 xm6, m4, 1 + movq [r8], xm4 + movhps [r8 + r3], xm4 + movq [r8 + r3 * 2], xm6 + movhps [r8 + r6], xm6 +%else + vpermq m4, m4, 11011000b + vpermq m6, m6, 11011000b + vextracti128 xm5, m4, 1 + vextracti128 xm7, m6, 1 + movu [r8], xm4 + movu [r8 + r3], xm5 + movu [r8 + r3 * 2], xm6 + movu [r8 + r6], xm7 +%endif + + movu xm2, [r7 + r4] ; m2 = row 11 + punpckhwd xm4, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm4, 1 + pmaddwd m4, m1, [r5 + 1 * mmsize] + paddd m8, m4 + pmaddwd m1, [r5] + lea r7, [r7 + r1 * 4] + movu xm4, [r7] ; m4 = row 12 + punpckhwd xm5, xm2, xm4 + punpcklwd xm2, xm4 + vinserti128 m2, m2, xm5, 1 + pmaddwd m5, m2, [r5 + 1 * mmsize] + paddd m0, m5 + pmaddwd m2, [r5] + movu xm5, [r7 + r1] ; m5 = row 13 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m1, m6 + pmaddwd m4, [r5] + movu xm6, [r7 + r1 * 2] ; m6 = row 14 + punpckhwd xm7, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddwd m7, m5, [r5 + 1 * mmsize] + paddd m2, m7 + pmaddwd m5, [r5] +%ifidn %1,sp + paddd m8, m9 + paddd m0, m9 + paddd m1, m9 + paddd m2, m9 + psrad m8, 12 + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 +%else + psrad m8, 6 + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 +%endif + packssdw m8, m0 + packssdw m1, m2 + lea r8, [r8 + r3 * 4] +%ifidn %1,sp + packuswb m8, m1 + vpermd m8, m3, m8 + vextracti128 xm1, m8, 1 + movq [r8], xm8 + movhps [r8 + r3], xm8 + movq [r8 + r3 * 2], xm1 + movhps [r8 + r6], xm1 +%else + vpermq m8, m8, 11011000b + vpermq m1, m1, 11011000b + vextracti128 xm0, m8, 1 + vextracti128 xm2, m1, 1 + movu [r8], xm8 + movu [r8 + r3], xm0 + movu [r8 + r3 * 2], xm1 + movu [r8 + r6], xm2 +%endif + lea r8, [r8 + r3 * 4] + + movu xm7, [r7 + r4] ; m7 = row 15 + punpckhwd xm2, xm6, xm7 + punpcklwd xm6, xm7 + vinserti128 m6, m6, xm2, 1 + pmaddwd m2, m6, [r5 + 1 * mmsize] + paddd m4, m2 + pmaddwd m6, [r5] + lea r7, [r7 + r1 * 4] + movu xm2, [r7] ; m2 = row 16 + punpckhwd xm1, xm7, xm2 + punpcklwd xm7, xm2 + vinserti128 m7, m7, xm1, 1 + pmaddwd m1, m7, [r5 + 1 * mmsize] + paddd m5, m1 + pmaddwd m7, [r5] + movu xm1, [r7 + r1] ; m1 = row 17 + punpckhwd xm0, xm2, xm1 + punpcklwd xm2, xm1 + vinserti128 m2, m2, xm0, 1 + pmaddwd m2, [r5 + 1 * mmsize] + paddd m6, m2 + movu xm0, [r7 + r1 * 2] ; m0 = row 18 + punpckhwd xm2, xm1, xm0 + punpcklwd xm1, xm0 + vinserti128 m1, m1, xm2, 1 + pmaddwd m1, [r5 + 1 * mmsize] + paddd m7, m1 + +%ifidn %1,sp + paddd m4, m9 + paddd m5, m9 + paddd m6, m9 + paddd m7, m9 + psrad m4, 12 + psrad m5, 12 + psrad m6, 12 + psrad m7, 12 +%else + psrad m4, 6 + psrad m5, 6 + psrad m6, 6 + psrad m7, 6 +%endif + packssdw m4, m5 + packssdw m6, m7 +%ifidn %1,sp + packuswb m4, m6 + vpermd m4, m3, m4 + vextracti128 xm6, m4, 1 + movq [r8], xm4 + movhps [r8 + r3], xm4 + movq [r8 + r3 * 2], xm6 + movhps [r8 + r6], xm6 +%else + vpermq m4, m4, 11011000b + vpermq m6, m6, 11011000b + vextracti128 xm5, m4, 1 + vextracti128 xm7, m6, 1 + movu [r8], xm4 + movu [r8 + r3], xm5 + movu [r8 + r3 * 2], xm6 + movu [r8 + r6], xm7 +%endif +%endmacro + +%macro FILTER_VER_CHROMA_S_AVX2_Nx16 2 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_%2x16, 4, 10, 10 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m9, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] + mov r9d, %2 / 8 +.loopW: + PROCESS_CHROMA_S_AVX2_W8_16R %1 +%ifidn %1,sp + add r2, 8 +%else + add r2, 16 +%endif + add r0, 16 + dec r9d + jnz .loopW + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 16 + FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 32 + FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 64 + FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 16 + FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 32 + FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 64 + +%macro FILTER_VER_CHROMA_S_AVX2_NxN 3 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%3_%1x%2, 4, 11, 10 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %3,sp + mova m9, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] + mov r9d, %2 / 16 +.loopH: + mov r10d, %1 / 8 +.loopW: + PROCESS_CHROMA_S_AVX2_W8_16R %3 +%ifidn %3,sp + add r2, 8 +%else + add r2, 16 +%endif + add r0, 16 + dec r10d + jnz .loopW + lea r0, [r7 - 2 * %1 + 16] +%ifidn %3,sp + lea r2, [r8 + r3 * 4 - %1 + 8] +%else + lea r2, [r8 + r3 * 4 - 2 * %1 + 16] +%endif + dec r9d + jnz .loopH + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, sp + FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, sp + FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, sp + FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, ss + FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, ss + FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, ss + FILTER_VER_CHROMA_S_AVX2_NxN 16, 64, sp + FILTER_VER_CHROMA_S_AVX2_NxN 24, 64, sp + FILTER_VER_CHROMA_S_AVX2_NxN 32, 64, sp + FILTER_VER_CHROMA_S_AVX2_NxN 32, 48, sp + FILTER_VER_CHROMA_S_AVX2_NxN 32, 48, ss + FILTER_VER_CHROMA_S_AVX2_NxN 16, 64, ss + FILTER_VER_CHROMA_S_AVX2_NxN 24, 64, ss + FILTER_VER_CHROMA_S_AVX2_NxN 32, 64, ss + FILTER_VER_CHROMA_S_AVX2_NxN 64, 64, sp + FILTER_VER_CHROMA_S_AVX2_NxN 64, 32, sp + FILTER_VER_CHROMA_S_AVX2_NxN 64, 48, sp + FILTER_VER_CHROMA_S_AVX2_NxN 48, 64, sp + FILTER_VER_CHROMA_S_AVX2_NxN 64, 64, ss + FILTER_VER_CHROMA_S_AVX2_NxN 64, 32, ss + FILTER_VER_CHROMA_S_AVX2_NxN 64, 48, ss + FILTER_VER_CHROMA_S_AVX2_NxN 48, 64, ss + +%macro PROCESS_CHROMA_S_AVX2_W8_4R 1 + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m2, m4 + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm4, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm4, 1 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m3, m5 +%ifidn %1,sp + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 +%endif + packssdw m0, m1 + packssdw m2, m3 +%ifidn %1,sp + packuswb m0, m2 + mova m3, [interp8_hps_shuf] + vpermd m0, m3, m0 + vextracti128 xm2, m0, 1 +%else + vpermq m0, m0, 11011000b + vpermq m2, m2, 11011000b + vextracti128 xm1, m0, 1 + vextracti128 xm3, m2, 1 +%endif +%endmacro + +%macro FILTER_VER_CHROMA_S_AVX2_8x4 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x4, 4, 6, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m7, [pd_526336] +%else + add r3d, r3d +%endif + + PROCESS_CHROMA_S_AVX2_W8_4R %1 + lea r4, [r3 * 3] +%ifidn %1,sp + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r4], xm2 +%else + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r4], xm3 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_S_AVX2_8x4 sp + FILTER_VER_CHROMA_S_AVX2_8x4 ss + +%macro FILTER_VER_CHROMA_S_AVX2_12x16 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_12x16, 4, 9, 10 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m9, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] + PROCESS_CHROMA_S_AVX2_W8_16R %1 +%ifidn %1,sp + add r2, 8 +%else + add r2, 16 +%endif + add r0, 16 + mova m7, m9 + PROCESS_CHROMA_AVX2_W4_16R %1 + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_12x16 sp + FILTER_VER_CHROMA_S_AVX2_12x16 ss + +%macro FILTER_VER_CHROMA_S_AVX2_12x32 1 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_12x32, 4, 9, 10 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1, sp + mova m9, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] +%rep 2 + PROCESS_CHROMA_S_AVX2_W8_16R %1 +%ifidn %1, sp + add r2, 8 +%else + add r2, 16 +%endif + add r0, 16 + mova m7, m9 + PROCESS_CHROMA_AVX2_W4_16R %1 + sub r0, 16 +%ifidn %1, sp + lea r2, [r2 + r3 * 4 - 8] +%else + lea r2, [r2 + r3 * 4 - 16] +%endif +%endrep + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_12x32 sp + FILTER_VER_CHROMA_S_AVX2_12x32 ss + +%macro FILTER_VER_CHROMA_S_AVX2_16x12 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_16x12, 4, 9, 9 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m8, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] +%rep 2 + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + lea r7, [r0 + r1 * 4] + movu xm4, [r7] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] +%ifidn %1,sp + paddd m0, m8 + paddd m1, m8 + psrad m0, 12 + psrad m1, 12 +%else + psrad m0, 6 + psrad m1, 6 +%endif + packssdw m0, m1 + + movu xm5, [r7 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + movu xm6, [r7 + r1 * 2] ; m6 = row 6 + punpckhwd xm1, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m1 +%ifidn %1,sp + paddd m2, m8 + paddd m3, m8 + psrad m2, 12 + psrad m3, 12 +%else + psrad m2, 6 + psrad m3, 6 +%endif + packssdw m2, m3 +%ifidn %1,sp + packuswb m0, m2 + mova m3, [interp8_hps_shuf] + vpermd m0, m3, m0 + vextracti128 xm2, m0, 1 + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r6], xm2 +%else + vpermq m0, m0, 11011000b + vpermq m2, m2, 11011000b + movu [r2], xm0 + vextracti128 xm0, m0, 1 + vextracti128 xm3, m2, 1 + movu [r2 + r3], xm0 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm3 +%endif + lea r8, [r2 + r3 * 4] + + movu xm1, [r7 + r4] ; m1 = row 7 + punpckhwd xm0, xm6, xm1 + punpcklwd xm6, xm1 + vinserti128 m6, m6, xm0, 1 + pmaddwd m0, m6, [r5 + 1 * mmsize] + pmaddwd m6, [r5] + paddd m4, m0 + lea r7, [r7 + r1 * 4] + movu xm0, [r7] ; m0 = row 8 + punpckhwd xm2, xm1, xm0 + punpcklwd xm1, xm0 + vinserti128 m1, m1, xm2, 1 + pmaddwd m2, m1, [r5 + 1 * mmsize] + pmaddwd m1, [r5] + paddd m5, m2 +%ifidn %1,sp + paddd m4, m8 + paddd m5, m8 + psrad m4, 12 + psrad m5, 12 +%else + psrad m4, 6 + psrad m5, 6 +%endif + packssdw m4, m5 + + movu xm2, [r7 + r1] ; m2 = row 9 + punpckhwd xm5, xm0, xm2 + punpcklwd xm0, xm2 + vinserti128 m0, m0, xm5, 1 + pmaddwd m5, m0, [r5 + 1 * mmsize] + paddd m6, m5 + pmaddwd m0, [r5] + movu xm5, [r7 + r1 * 2] ; m5 = row 10 + punpckhwd xm7, xm2, xm5 + punpcklwd xm2, xm5 + vinserti128 m2, m2, xm7, 1 + pmaddwd m7, m2, [r5 + 1 * mmsize] + paddd m1, m7 + pmaddwd m2, [r5] + +%ifidn %1,sp + paddd m6, m8 + paddd m1, m8 + psrad m6, 12 + psrad m1, 12 +%else + psrad m6, 6 + psrad m1, 6 +%endif + packssdw m6, m1 +%ifidn %1,sp + packuswb m4, m6 + vpermd m4, m3, m4 + vextracti128 xm6, m4, 1 + movq [r8], xm4 + movhps [r8 + r3], xm4 + movq [r8 + r3 * 2], xm6 + movhps [r8 + r6], xm6 +%else + vpermq m4, m4, 11011000b + vpermq m6, m6, 11011000b + vextracti128 xm7, m4, 1 + vextracti128 xm1, m6, 1 + movu [r8], xm4 + movu [r8 + r3], xm7 + movu [r8 + r3 * 2], xm6 + movu [r8 + r6], xm1 +%endif + lea r8, [r8 + r3 * 4] + + movu xm7, [r7 + r4] ; m7 = row 11 + punpckhwd xm1, xm5, xm7 + punpcklwd xm5, xm7 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + paddd m0, m1 + pmaddwd m5, [r5] + lea r7, [r7 + r1 * 4] + movu xm1, [r7] ; m1 = row 12 + punpckhwd xm4, xm7, xm1 + punpcklwd xm7, xm1 + vinserti128 m7, m7, xm4, 1 + pmaddwd m4, m7, [r5 + 1 * mmsize] + paddd m2, m4 + pmaddwd m7, [r5] +%ifidn %1,sp + paddd m0, m8 + paddd m2, m8 + psrad m0, 12 + psrad m2, 12 +%else + psrad m0, 6 + psrad m2, 6 +%endif + packssdw m0, m2 + + movu xm4, [r7 + r1] ; m4 = row 13 + punpckhwd xm2, xm1, xm4 + punpcklwd xm1, xm4 + vinserti128 m1, m1, xm2, 1 + pmaddwd m1, [r5 + 1 * mmsize] + paddd m5, m1 + movu xm2, [r7 + r1 * 2] ; m2 = row 14 + punpckhwd xm6, xm4, xm2 + punpcklwd xm4, xm2 + vinserti128 m4, m4, xm6, 1 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m7, m4 +%ifidn %1,sp + paddd m5, m8 + paddd m7, m8 + psrad m5, 12 + psrad m7, 12 +%else + psrad m5, 6 + psrad m7, 6 +%endif + packssdw m5, m7 +%ifidn %1,sp + packuswb m0, m5 + vpermd m0, m3, m0 + vextracti128 xm5, m0, 1 + movq [r8], xm0 + movhps [r8 + r3], xm0 + movq [r8 + r3 * 2], xm5 + movhps [r8 + r6], xm5 + add r2, 8 +%else + vpermq m0, m0, 11011000b + vpermq m5, m5, 11011000b + vextracti128 xm7, m0, 1 + vextracti128 xm6, m5, 1 + movu [r8], xm0 + movu [r8 + r3], xm7 + movu [r8 + r3 * 2], xm5 + movu [r8 + r6], xm6 + add r2, 16 +%endif + add r0, 16 +%endrep + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_16x12 sp + FILTER_VER_CHROMA_S_AVX2_16x12 ss + +%macro FILTER_VER_CHROMA_S_AVX2_8x12 1 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x12, 4, 7, 9 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m8, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] +%ifidn %1,sp + paddd m0, m8 + paddd m1, m8 + psrad m0, 12 + psrad m1, 12 +%else + psrad m0, 6 + psrad m1, 6 +%endif + packssdw m0, m1 + + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm1, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m1 +%ifidn %1,sp + paddd m2, m8 + paddd m3, m8 + psrad m2, 12 + psrad m3, 12 +%else + psrad m2, 6 + psrad m3, 6 +%endif + packssdw m2, m3 +%ifidn %1,sp + packuswb m0, m2 + mova m3, [interp8_hps_shuf] + vpermd m0, m3, m0 + vextracti128 xm2, m0, 1 + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r6], xm2 +%else + vpermq m0, m0, 11011000b + vpermq m2, m2, 11011000b + movu [r2], xm0 + vextracti128 xm0, m0, 1 + vextracti128 xm3, m2, 1 + movu [r2 + r3], xm0 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm3 +%endif + lea r2, [r2 + r3 * 4] + + movu xm1, [r0 + r4] ; m1 = row 7 + punpckhwd xm0, xm6, xm1 + punpcklwd xm6, xm1 + vinserti128 m6, m6, xm0, 1 + pmaddwd m0, m6, [r5 + 1 * mmsize] + pmaddwd m6, [r5] + paddd m4, m0 + lea r0, [r0 + r1 * 4] + movu xm0, [r0] ; m0 = row 8 + punpckhwd xm2, xm1, xm0 + punpcklwd xm1, xm0 + vinserti128 m1, m1, xm2, 1 + pmaddwd m2, m1, [r5 + 1 * mmsize] + pmaddwd m1, [r5] + paddd m5, m2 +%ifidn %1,sp + paddd m4, m8 + paddd m5, m8 + psrad m4, 12 + psrad m5, 12 +%else + psrad m4, 6 + psrad m5, 6 +%endif + packssdw m4, m5 + + movu xm2, [r0 + r1] ; m2 = row 9 + punpckhwd xm5, xm0, xm2 + punpcklwd xm0, xm2 + vinserti128 m0, m0, xm5, 1 + pmaddwd m5, m0, [r5 + 1 * mmsize] + paddd m6, m5 + pmaddwd m0, [r5] + movu xm5, [r0 + r1 * 2] ; m5 = row 10 + punpckhwd xm7, xm2, xm5 + punpcklwd xm2, xm5 + vinserti128 m2, m2, xm7, 1 + pmaddwd m7, m2, [r5 + 1 * mmsize] + paddd m1, m7 + pmaddwd m2, [r5] + +%ifidn %1,sp + paddd m6, m8 + paddd m1, m8 + psrad m6, 12 + psrad m1, 12 +%else + psrad m6, 6 + psrad m1, 6 +%endif + packssdw m6, m1 +%ifidn %1,sp + packuswb m4, m6 + vpermd m4, m3, m4 + vextracti128 xm6, m4, 1 + movq [r2], xm4 + movhps [r2 + r3], xm4 + movq [r2 + r3 * 2], xm6 + movhps [r2 + r6], xm6 +%else + vpermq m4, m4, 11011000b + vpermq m6, m6, 11011000b + vextracti128 xm7, m4, 1 + vextracti128 xm1, m6, 1 + movu [r2], xm4 + movu [r2 + r3], xm7 + movu [r2 + r3 * 2], xm6 + movu [r2 + r6], xm1 +%endif + lea r2, [r2 + r3 * 4] + + movu xm7, [r0 + r4] ; m7 = row 11 + punpckhwd xm1, xm5, xm7 + punpcklwd xm5, xm7 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + paddd m0, m1 + pmaddwd m5, [r5] + lea r0, [r0 + r1 * 4] + movu xm1, [r0] ; m1 = row 12 + punpckhwd xm4, xm7, xm1 + punpcklwd xm7, xm1 + vinserti128 m7, m7, xm4, 1 + pmaddwd m4, m7, [r5 + 1 * mmsize] + paddd m2, m4 + pmaddwd m7, [r5] +%ifidn %1,sp + paddd m0, m8 + paddd m2, m8 + psrad m0, 12 + psrad m2, 12 +%else + psrad m0, 6 + psrad m2, 6 +%endif + packssdw m0, m2 + + movu xm4, [r0 + r1] ; m4 = row 13 + punpckhwd xm2, xm1, xm4 + punpcklwd xm1, xm4 + vinserti128 m1, m1, xm2, 1 + pmaddwd m1, [r5 + 1 * mmsize] + paddd m5, m1 + movu xm2, [r0 + r1 * 2] ; m2 = row 14 + punpckhwd xm6, xm4, xm2 + punpcklwd xm4, xm2 + vinserti128 m4, m4, xm6, 1 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m7, m4 +%ifidn %1,sp + paddd m5, m8 + paddd m7, m8 + psrad m5, 12 + psrad m7, 12 +%else + psrad m5, 6 + psrad m7, 6 +%endif + packssdw m5, m7 +%ifidn %1,sp + packuswb m0, m5 + vpermd m0, m3, m0 + vextracti128 xm5, m0, 1 + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm5 + movhps [r2 + r6], xm5 +%else + vpermq m0, m0, 11011000b + vpermq m5, m5, 11011000b + vextracti128 xm7, m0, 1 + vextracti128 xm6, m5, 1 + movu [r2], xm0 + movu [r2 + r3], xm7 + movu [r2 + r3 * 2], xm5 + movu [r2 + r6], xm6 +%endif + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_8x12 sp + FILTER_VER_CHROMA_S_AVX2_8x12 ss + +%macro FILTER_VER_CHROMA_S_AVX2_16x4 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_16x4, 4, 7, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m7, [pd_526336] +%else + add r3d, r3d +%endif +%rep 2 + PROCESS_CHROMA_S_AVX2_W8_4R %1 + lea r6, [r3 * 3] +%ifidn %1,sp + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r6], xm2 + add r2, 8 +%else + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm3 + add r2, 16 +%endif + lea r6, [4 * r1 - 16] + sub r0, r6 +%endrep + RET +%endmacro + + FILTER_VER_CHROMA_S_AVX2_16x4 sp + FILTER_VER_CHROMA_S_AVX2_16x4 ss + +%macro PROCESS_CHROMA_S_AVX2_W8_8R 1 + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + lea r7, [r0 + r1 * 4] + movu xm4, [r7] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] +%ifidn %1,sp + paddd m0, m7 + paddd m1, m7 + psrad m0, 12 + psrad m1, 12 +%else + psrad m0, 6 + psrad m1, 6 +%endif + packssdw m0, m1 + + movu xm5, [r7 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + movu xm6, [r7 + r1 * 2] ; m6 = row 6 + punpckhwd xm1, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m1 +%ifidn %1,sp + paddd m2, m7 + paddd m3, m7 + psrad m2, 12 + psrad m3, 12 +%else + psrad m2, 6 + psrad m3, 6 +%endif + packssdw m2, m3 +%ifidn %1,sp + packuswb m0, m2 + mova m3, [interp8_hps_shuf] + vpermd m0, m3, m0 + vextracti128 xm2, m0, 1 + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r6], xm2 +%else + vpermq m0, m0, 11011000b + vpermq m2, m2, 11011000b + movu [r2], xm0 + vextracti128 xm0, m0, 1 + vextracti128 xm3, m2, 1 + movu [r2 + r3], xm0 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm3 +%endif + lea r8, [r2 + r3 * 4] + + movu xm1, [r7 + r4] ; m1 = row 7 + punpckhwd xm0, xm6, xm1 + punpcklwd xm6, xm1 + vinserti128 m6, m6, xm0, 1 + pmaddwd m0, m6, [r5 + 1 * mmsize] + pmaddwd m6, [r5] + paddd m4, m0 + lea r7, [r7 + r1 * 4] + movu xm0, [r7] ; m0 = row 8 + punpckhwd xm2, xm1, xm0 + punpcklwd xm1, xm0 + vinserti128 m1, m1, xm2, 1 + pmaddwd m2, m1, [r5 + 1 * mmsize] + pmaddwd m1, [r5] + paddd m5, m2 +%ifidn %1,sp + paddd m4, m7 + paddd m5, m7 + psrad m4, 12 + psrad m5, 12 +%else + psrad m4, 6 + psrad m5, 6 +%endif + packssdw m4, m5 + + movu xm2, [r7 + r1] ; m2 = row 9 + punpckhwd xm5, xm0, xm2 + punpcklwd xm0, xm2 + vinserti128 m0, m0, xm5, 1 + pmaddwd m0, [r5 + 1 * mmsize] + paddd m6, m0 + movu xm5, [r7 + r1 * 2] ; m5 = row 10 + punpckhwd xm0, xm2, xm5 + punpcklwd xm2, xm5 + vinserti128 m2, m2, xm0, 1 + pmaddwd m2, [r5 + 1 * mmsize] + paddd m1, m2 + +%ifidn %1,sp + paddd m6, m7 + paddd m1, m7 + psrad m6, 12 + psrad m1, 12 +%else + psrad m6, 6 + psrad m1, 6 +%endif + packssdw m6, m1 +%ifidn %1,sp + packuswb m4, m6 + vpermd m4, m3, m4 + vextracti128 xm6, m4, 1 + movq [r8], xm4 + movhps [r8 + r3], xm4 + movq [r8 + r3 * 2], xm6 + movhps [r8 + r6], xm6 +%else + vpermq m4, m4, 11011000b + vpermq m6, m6, 11011000b + vextracti128 xm7, m4, 1 + vextracti128 xm1, m6, 1 + movu [r8], xm4 + movu [r8 + r3], xm7 + movu [r8 + r3 * 2], xm6 + movu [r8 + r6], xm1 +%endif +%endmacro + +%macro FILTER_VER_CHROMA_S_AVX2_Nx8 2 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_%2x8, 4, 9, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m7, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] +%rep %2 / 8 + PROCESS_CHROMA_S_AVX2_W8_8R %1 +%ifidn %1,sp + add r2, 8 +%else + add r2, 16 +%endif + add r0, 16 +%endrep + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 32 + FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 16 + FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 32 + FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 16 + +%macro FILTER_VER_CHROMA_S_AVX2_8x2 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x2, 4, 6, 6 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m5, [pd_526336] +%else + add r3d, r3d +%endif + + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m2, [r5 + 1 * mmsize] + paddd m0, m2 + movu xm4, [r0 + r1 * 4] ; m4 = row 4 + punpckhwd xm2, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm2, 1 + pmaddwd m3, [r5 + 1 * mmsize] + paddd m1, m3 +%ifidn %1,sp + paddd m0, m5 + paddd m1, m5 + psrad m0, 12 + psrad m1, 12 +%else + psrad m0, 6 + psrad m1, 6 +%endif + packssdw m0, m1 +%ifidn %1,sp + vextracti128 xm1, m0, 1 + packuswb xm0, xm1 + pshufd xm0, xm0, 11011000b + movq [r2], xm0 + movhps [r2 + r3], xm0 +%else + vpermq m0, m0, 11011000b + vextracti128 xm1, m0, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_S_AVX2_8x2 sp + FILTER_VER_CHROMA_S_AVX2_8x2 ss + +%macro FILTER_VER_CHROMA_S_AVX2_8x6 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x6, 4, 6, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m7, [pd_526336] +%else + add r3d, r3d +%endif + + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m4 + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + pmaddwd m3, [r5] + paddd m1, m5 +%ifidn %1,sp + paddd m0, m7 + paddd m1, m7 + psrad m0, 12 + psrad m1, 12 +%else + psrad m0, 6 + psrad m1, 6 +%endif + packssdw m0, m1 + + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm1, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m1 +%ifidn %1,sp + paddd m2, m7 + paddd m3, m7 + psrad m2, 12 + psrad m3, 12 +%else + psrad m2, 6 + psrad m3, 6 +%endif + packssdw m2, m3 + + movu xm1, [r0 + r4] ; m1 = row 7 + punpckhwd xm3, xm6, xm1 + punpcklwd xm6, xm1 + vinserti128 m6, m6, xm3, 1 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m4, m6 + movu xm6, [r0 + r1 * 4] ; m6 = row 8 + punpckhwd xm3, xm1, xm6 + punpcklwd xm1, xm6 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5 + 1 * mmsize] + paddd m5, m1 +%ifidn %1,sp + paddd m4, m7 + paddd m5, m7 + psrad m4, 12 + psrad m5, 12 +%else + psrad m4, 6 + psrad m5, 6 +%endif + packssdw m4, m5 + lea r4, [r3 * 3] +%ifidn %1,sp + packuswb m0, m2 + mova m3, [interp8_hps_shuf] + vpermd m0, m3, m0 + vextracti128 xm2, m0, 1 + vextracti128 xm5, m4, 1 + packuswb xm4, xm5 + pshufd xm4, xm4, 11011000b + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r4], xm2 + lea r2, [r2 + r3 * 4] + movq [r2], xm4 + movhps [r2 + r3], xm4 +%else + vpermq m0, m0, 11011000b + vpermq m2, m2, 11011000b + vpermq m4, m4, 11011000b + movu [r2], xm0 + vextracti128 xm0, m0, 1 + vextracti128 xm3, m2, 1 + vextracti128 xm5, m4, 1 + movu [r2 + r3], xm0 + movu [r2 + r3 * 2], xm2 + movu [r2 + r4], xm3 + lea r2, [r2 + r3 * 4] + movu [r2], xm4 + movu [r2 + r3], xm5 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_S_AVX2_8x6 sp + FILTER_VER_CHROMA_S_AVX2_8x6 ss + +%macro FILTER_VER_CHROMA_S_AVX2_8xN 2 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_8x%2, 4, 7, 9 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m8, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] +%rep %2 / 16 + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] +%ifidn %1,sp + paddd m0, m8 + paddd m1, m8 + psrad m0, 12 + psrad m1, 12 +%else + psrad m0, 6 + psrad m1, 6 +%endif + packssdw m0, m1 + + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm1, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m1 +%ifidn %1,sp + paddd m2, m8 + paddd m3, m8 + psrad m2, 12 + psrad m3, 12 +%else + psrad m2, 6 + psrad m3, 6 +%endif + packssdw m2, m3 +%ifidn %1,sp + packuswb m0, m2 + mova m3, [interp8_hps_shuf] + vpermd m0, m3, m0 + vextracti128 xm2, m0, 1 + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r6], xm2 +%else + vpermq m0, m0, 11011000b + vpermq m2, m2, 11011000b + movu [r2], xm0 + vextracti128 xm0, m0, 1 + vextracti128 xm3, m2, 1 + movu [r2 + r3], xm0 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm3 +%endif + lea r2, [r2 + r3 * 4] + + movu xm1, [r0 + r4] ; m1 = row 7 + punpckhwd xm0, xm6, xm1 + punpcklwd xm6, xm1 + vinserti128 m6, m6, xm0, 1 + pmaddwd m0, m6, [r5 + 1 * mmsize] + pmaddwd m6, [r5] + paddd m4, m0 + lea r0, [r0 + r1 * 4] + movu xm0, [r0] ; m0 = row 8 + punpckhwd xm2, xm1, xm0 + punpcklwd xm1, xm0 + vinserti128 m1, m1, xm2, 1 + pmaddwd m2, m1, [r5 + 1 * mmsize] + pmaddwd m1, [r5] + paddd m5, m2 +%ifidn %1,sp + paddd m4, m8 + paddd m5, m8 + psrad m4, 12 + psrad m5, 12 +%else + psrad m4, 6 + psrad m5, 6 +%endif + packssdw m4, m5 + + movu xm2, [r0 + r1] ; m2 = row 9 + punpckhwd xm5, xm0, xm2 + punpcklwd xm0, xm2 + vinserti128 m0, m0, xm5, 1 + pmaddwd m5, m0, [r5 + 1 * mmsize] + paddd m6, m5 + pmaddwd m0, [r5] + movu xm5, [r0 + r1 * 2] ; m5 = row 10 + punpckhwd xm7, xm2, xm5 + punpcklwd xm2, xm5 + vinserti128 m2, m2, xm7, 1 + pmaddwd m7, m2, [r5 + 1 * mmsize] + paddd m1, m7 + pmaddwd m2, [r5] + +%ifidn %1,sp + paddd m6, m8 + paddd m1, m8 + psrad m6, 12 + psrad m1, 12 +%else + psrad m6, 6 + psrad m1, 6 +%endif + packssdw m6, m1 +%ifidn %1,sp + packuswb m4, m6 + vpermd m4, m3, m4 + vextracti128 xm6, m4, 1 + movq [r2], xm4 + movhps [r2 + r3], xm4 + movq [r2 + r3 * 2], xm6 + movhps [r2 + r6], xm6 +%else + vpermq m4, m4, 11011000b + vpermq m6, m6, 11011000b + vextracti128 xm7, m4, 1 + vextracti128 xm1, m6, 1 + movu [r2], xm4 + movu [r2 + r3], xm7 + movu [r2 + r3 * 2], xm6 + movu [r2 + r6], xm1 +%endif + lea r2, [r2 + r3 * 4] + + movu xm7, [r0 + r4] ; m7 = row 11 + punpckhwd xm1, xm5, xm7 + punpcklwd xm5, xm7 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + paddd m0, m1 + pmaddwd m5, [r5] + lea r0, [r0 + r1 * 4] + movu xm1, [r0] ; m1 = row 12 + punpckhwd xm4, xm7, xm1 + punpcklwd xm7, xm1 + vinserti128 m7, m7, xm4, 1 + pmaddwd m4, m7, [r5 + 1 * mmsize] + paddd m2, m4 + pmaddwd m7, [r5] +%ifidn %1,sp + paddd m0, m8 + paddd m2, m8 + psrad m0, 12 + psrad m2, 12 +%else + psrad m0, 6 + psrad m2, 6 +%endif + packssdw m0, m2 + + movu xm4, [r0 + r1] ; m4 = row 13 + punpckhwd xm2, xm1, xm4 + punpcklwd xm1, xm4 + vinserti128 m1, m1, xm2, 1 + pmaddwd m2, m1, [r5 + 1 * mmsize] + paddd m5, m2 + pmaddwd m1, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 14 + punpckhwd xm6, xm4, xm2 + punpcklwd xm4, xm2 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m7, m6 + pmaddwd m4, [r5] +%ifidn %1,sp + paddd m5, m8 + paddd m7, m8 + psrad m5, 12 + psrad m7, 12 +%else + psrad m5, 6 + psrad m7, 6 +%endif + packssdw m5, m7 +%ifidn %1,sp + packuswb m0, m5 + vpermd m0, m3, m0 + vextracti128 xm5, m0, 1 + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm5 + movhps [r2 + r6], xm5 +%else + vpermq m0, m0, 11011000b + vpermq m5, m5, 11011000b + vextracti128 xm7, m0, 1 + vextracti128 xm6, m5, 1 + movu [r2], xm0 + movu [r2 + r3], xm7 + movu [r2 + r3 * 2], xm5 + movu [r2 + r6], xm6 +%endif + lea r2, [r2 + r3 * 4] + + movu xm6, [r0 + r4] ; m6 = row 15 + punpckhwd xm5, xm2, xm6 + punpcklwd xm2, xm6 + vinserti128 m2, m2, xm5, 1 + pmaddwd m5, m2, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm0, [r0] ; m0 = row 16 + punpckhwd xm5, xm6, xm0 + punpcklwd xm6, xm0 + vinserti128 m6, m6, xm5, 1 + pmaddwd m5, m6, [r5 + 1 * mmsize] + paddd m4, m5 + pmaddwd m6, [r5] +%ifidn %1,sp + paddd m1, m8 + paddd m4, m8 + psrad m1, 12 + psrad m4, 12 +%else + psrad m1, 6 + psrad m4, 6 +%endif + packssdw m1, m4 + + movu xm5, [r0 + r1] ; m5 = row 17 + punpckhwd xm4, xm0, xm5 + punpcklwd xm0, xm5 + vinserti128 m0, m0, xm4, 1 + pmaddwd m0, [r5 + 1 * mmsize] + paddd m2, m0 + movu xm4, [r0 + r1 * 2] ; m4 = row 18 + punpckhwd xm0, xm5, xm4 + punpcklwd xm5, xm4 + vinserti128 m5, m5, xm0, 1 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m6, m5 +%ifidn %1,sp + paddd m2, m8 + paddd m6, m8 + psrad m2, 12 + psrad m6, 12 +%else + psrad m2, 6 + psrad m6, 6 +%endif + packssdw m2, m6 +%ifidn %1,sp + packuswb m1, m2 + vpermd m1, m3, m1 + vextracti128 xm2, m1, 1 + movq [r2], xm1 + movhps [r2 + r3], xm1 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r6], xm2 +%else + vpermq m1, m1, 11011000b + vpermq m2, m2, 11011000b + vextracti128 xm6, m1, 1 + vextracti128 xm4, m2, 1 + movu [r2], xm1 + movu [r2 + r3], xm6 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm4 +%endif + lea r2, [r2 + r3 * 4] +%endrep + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_8xN sp, 16 + FILTER_VER_CHROMA_S_AVX2_8xN sp, 32 + FILTER_VER_CHROMA_S_AVX2_8xN sp, 64 + FILTER_VER_CHROMA_S_AVX2_8xN ss, 16 + FILTER_VER_CHROMA_S_AVX2_8xN ss, 32 + FILTER_VER_CHROMA_S_AVX2_8xN ss, 64 + +%macro FILTER_VER_CHROMA_S_AVX2_Nx24 2 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_%2x24, 4, 10, 10 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m9, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] + mov r9d, %2 / 8 +.loopW: + PROCESS_CHROMA_S_AVX2_W8_16R %1 +%ifidn %1,sp + add r2, 8 +%else + add r2, 16 +%endif + add r0, 16 + dec r9d + jnz .loopW +%ifidn %1,sp + lea r2, [r8 + r3 * 4 - %2 + 8] +%else + lea r2, [r8 + r3 * 4 - 2 * %2 + 16] +%endif + lea r0, [r7 - 2 * %2 + 16] + mova m7, m9 + mov r9d, %2 / 8 +.loop: + PROCESS_CHROMA_S_AVX2_W8_8R %1 +%ifidn %1,sp + add r2, 8 +%else + add r2, 16 +%endif + add r0, 16 + dec r9d + jnz .loop + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_Nx24 sp, 32 + FILTER_VER_CHROMA_S_AVX2_Nx24 sp, 16 + FILTER_VER_CHROMA_S_AVX2_Nx24 ss, 32 + FILTER_VER_CHROMA_S_AVX2_Nx24 ss, 16 + +%macro FILTER_VER_CHROMA_S_AVX2_2x8 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_2x8, 4, 6, 7 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + sub r0, r1 + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] +%ifidn %1,sp + mova m6, [pd_526336] +%else + add r3d, r3d +%endif + movd xm0, [r0] + movd xm1, [r0 + r1] + punpcklwd xm0, xm1 + movd xm2, [r0 + r1 * 2] + punpcklwd xm1, xm2 + punpcklqdq xm0, xm1 ; m0 = [2 1 1 0] + movd xm3, [r0 + r4] + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movd xm4, [r0] + punpcklwd xm3, xm4 + punpcklqdq xm2, xm3 ; m2 = [4 3 3 2] + vinserti128 m0, m0, xm2, 1 ; m0 = [4 3 3 2 2 1 1 0] + movd xm1, [r0 + r1] + punpcklwd xm4, xm1 + movd xm3, [r0 + r1 * 2] + punpcklwd xm1, xm3 + punpcklqdq xm4, xm1 ; m4 = [6 5 5 4] + vinserti128 m2, m2, xm4, 1 ; m2 = [6 5 5 4 4 3 3 2] + pmaddwd m0, [r5] + pmaddwd m2, [r5 + 1 * mmsize] + paddd m0, m2 + movd xm1, [r0 + r4] + punpcklwd xm3, xm1 + lea r0, [r0 + 4 * r1] + movd xm2, [r0] + punpcklwd xm1, xm2 + punpcklqdq xm3, xm1 ; m3 = [8 7 7 6] + vinserti128 m4, m4, xm3, 1 ; m4 = [8 7 7 6 6 5 5 4] + movd xm1, [r0 + r1] + punpcklwd xm2, xm1 + movd xm5, [r0 + r1 * 2] + punpcklwd xm1, xm5 + punpcklqdq xm2, xm1 ; m2 = [10 9 9 8] + vinserti128 m3, m3, xm2, 1 ; m3 = [10 9 9 8 8 7 7 6] + pmaddwd m4, [r5] + pmaddwd m3, [r5 + 1 * mmsize] + paddd m4, m3 +%ifidn %1,sp + paddd m0, m6 + paddd m4, m6 + psrad m0, 12 + psrad m4, 12 +%else + psrad m0, 6 + psrad m4, 6 +%endif + packssdw m0, m4 + vextracti128 xm4, m0, 1 + lea r4, [r3 * 3] +%ifidn %1,sp + packuswb xm0, xm4 + pextrw [r2], xm0, 0 + pextrw [r2 + r3], xm0, 1 + pextrw [r2 + 2 * r3], xm0, 4 + pextrw [r2 + r4], xm0, 5 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm0, 2 + pextrw [r2 + r3], xm0, 3 + pextrw [r2 + 2 * r3], xm0, 6 + pextrw [r2 + r4], xm0, 7 +%else + movd [r2], xm0 + pextrd [r2 + r3], xm0, 1 + movd [r2 + 2 * r3], xm4 + pextrd [r2 + r4], xm4, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm0, 2 + pextrd [r2 + r3], xm0, 3 + pextrd [r2 + 2 * r3], xm4, 2 + pextrd [r2 + r4], xm4, 3 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_S_AVX2_2x8 sp + FILTER_VER_CHROMA_S_AVX2_2x8 ss + +%macro FILTER_VER_CHROMA_S_AVX2_2x16 1 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_2x16, 4, 6, 9 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + sub r0, r1 + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] +%ifidn %1,sp + mova m6, [pd_526336] +%else + add r3d, r3d +%endif + movd xm0, [r0] + movd xm1, [r0 + r1] + punpcklwd xm0, xm1 + movd xm2, [r0 + r1 * 2] + punpcklwd xm1, xm2 + punpcklqdq xm0, xm1 ; m0 = [2 1 1 0] + movd xm3, [r0 + r4] + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movd xm4, [r0] + punpcklwd xm3, xm4 + punpcklqdq xm2, xm3 ; m2 = [4 3 3 2] + vinserti128 m0, m0, xm2, 1 ; m0 = [4 3 3 2 2 1 1 0] + movd xm1, [r0 + r1] + punpcklwd xm4, xm1 + movd xm3, [r0 + r1 * 2] + punpcklwd xm1, xm3 + punpcklqdq xm4, xm1 ; m4 = [6 5 5 4] + vinserti128 m2, m2, xm4, 1 ; m2 = [6 5 5 4 4 3 3 2] + pmaddwd m0, [r5] + pmaddwd m2, [r5 + 1 * mmsize] + paddd m0, m2 + movd xm1, [r0 + r4] + punpcklwd xm3, xm1 + lea r0, [r0 + 4 * r1] + movd xm2, [r0] + punpcklwd xm1, xm2 + punpcklqdq xm3, xm1 ; m3 = [8 7 7 6] + vinserti128 m4, m4, xm3, 1 ; m4 = [8 7 7 6 6 5 5 4] + movd xm1, [r0 + r1] + punpcklwd xm2, xm1 + movd xm5, [r0 + r1 * 2] + punpcklwd xm1, xm5 + punpcklqdq xm2, xm1 ; m2 = [10 9 9 8] + vinserti128 m3, m3, xm2, 1 ; m3 = [10 9 9 8 8 7 7 6] + pmaddwd m4, [r5] + pmaddwd m3, [r5 + 1 * mmsize] + paddd m4, m3 + movd xm1, [r0 + r4] + punpcklwd xm5, xm1 + lea r0, [r0 + 4 * r1] + movd xm3, [r0] + punpcklwd xm1, xm3 + punpcklqdq xm5, xm1 ; m5 = [12 11 11 10] + vinserti128 m2, m2, xm5, 1 ; m2 = [12 11 11 10 10 9 9 8] + movd xm1, [r0 + r1] + punpcklwd xm3, xm1 + movd xm7, [r0 + r1 * 2] + punpcklwd xm1, xm7 + punpcklqdq xm3, xm1 ; m3 = [14 13 13 12] + vinserti128 m5, m5, xm3, 1 ; m5 = [14 13 13 12 12 11 11 10] + pmaddwd m2, [r5] + pmaddwd m5, [r5 + 1 * mmsize] + paddd m2, m5 + movd xm5, [r0 + r4] + punpcklwd xm7, xm5 + lea r0, [r0 + 4 * r1] + movd xm1, [r0] + punpcklwd xm5, xm1 + punpcklqdq xm7, xm5 ; m7 = [16 15 15 14] + vinserti128 m3, m3, xm7, 1 ; m3 = [16 15 15 14 14 13 13 12] + movd xm5, [r0 + r1] + punpcklwd xm1, xm5 + movd xm8, [r0 + r1 * 2] + punpcklwd xm5, xm8 + punpcklqdq xm1, xm5 ; m1 = [18 17 17 16] + vinserti128 m7, m7, xm1, 1 ; m7 = [18 17 17 16 16 15 15 14] + pmaddwd m3, [r5] + pmaddwd m7, [r5 + 1 * mmsize] + paddd m3, m7 +%ifidn %1,sp + paddd m0, m6 + paddd m4, m6 + paddd m2, m6 + paddd m3, m6 + psrad m0, 12 + psrad m4, 12 + psrad m2, 12 + psrad m3, 12 +%else + psrad m0, 6 + psrad m4, 6 + psrad m2, 6 + psrad m3, 6 +%endif + packssdw m0, m4 + packssdw m2, m3 + lea r4, [r3 * 3] +%ifidn %1,sp + packuswb m0, m2 + vextracti128 xm2, m0, 1 + pextrw [r2], xm0, 0 + pextrw [r2 + r3], xm0, 1 + pextrw [r2 + 2 * r3], xm2, 0 + pextrw [r2 + r4], xm2, 1 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm0, 2 + pextrw [r2 + r3], xm0, 3 + pextrw [r2 + 2 * r3], xm2, 2 + pextrw [r2 + r4], xm2, 3 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm0, 4 + pextrw [r2 + r3], xm0, 5 + pextrw [r2 + 2 * r3], xm2, 4 + pextrw [r2 + r4], xm2, 5 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm0, 6 + pextrw [r2 + r3], xm0, 7 + pextrw [r2 + 2 * r3], xm2, 6 + pextrw [r2 + r4], xm2, 7 +%else + vextracti128 xm4, m0, 1 + vextracti128 xm3, m2, 1 + movd [r2], xm0 + pextrd [r2 + r3], xm0, 1 + movd [r2 + 2 * r3], xm4 + pextrd [r2 + r4], xm4, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm0, 2 + pextrd [r2 + r3], xm0, 3 + pextrd [r2 + 2 * r3], xm4, 2 + pextrd [r2 + r4], xm4, 3 + lea r2, [r2 + r3 * 4] + movd [r2], xm2 + pextrd [r2 + r3], xm2, 1 + movd [r2 + 2 * r3], xm3 + pextrd [r2 + r4], xm3, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm2, 2 + pextrd [r2 + r3], xm2, 3 + pextrd [r2 + 2 * r3], xm3, 2 + pextrd [r2 + r4], xm3, 3 +%endif + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_2x16 sp + FILTER_VER_CHROMA_S_AVX2_2x16 ss + +%macro FILTER_VER_CHROMA_S_AVX2_6x8 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_6x8, 4, 6, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m7, [pd_526336] +%else + add r3d, r3d +%endif + + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m4 + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + pmaddwd m3, [r5] + paddd m1, m5 +%ifidn %1,sp + paddd m0, m7 + paddd m1, m7 + psrad m0, 12 + psrad m1, 12 +%else + psrad m0, 6 + psrad m1, 6 +%endif + packssdw m0, m1 + + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm1, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m1 +%ifidn %1,sp + paddd m2, m7 + paddd m3, m7 + psrad m2, 12 + psrad m3, 12 +%else + psrad m2, 6 + psrad m3, 6 +%endif + packssdw m2, m3 + + movu xm1, [r0 + r4] ; m1 = row 7 + punpckhwd xm3, xm6, xm1 + punpcklwd xm6, xm1 + vinserti128 m6, m6, xm3, 1 + pmaddwd m3, m6, [r5 + 1 * mmsize] + pmaddwd m6, [r5] + paddd m4, m3 + + lea r4, [r3 * 3] +%ifidn %1,sp + packuswb m0, m2 + vextracti128 xm2, m0, 1 + movd [r2], xm0 + pextrw [r2 + 4], xm2, 0 + pextrd [r2 + r3], xm0, 1 + pextrw [r2 + r3 + 4], xm2, 2 + pextrd [r2 + r3 * 2], xm0, 2 + pextrw [r2 + r3 * 2 + 4], xm2, 4 + pextrd [r2 + r4], xm0, 3 + pextrw [r2 + r4 + 4], xm2, 6 +%else + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r4], xm2 + vextracti128 xm0, m0, 1 + vextracti128 xm3, m2, 1 + movd [r2 + 8], xm0 + pextrd [r2 + r3 + 8], xm0, 2 + movd [r2 + r3 * 2 + 8], xm3 + pextrd [r2 + r4 + 8], xm3, 2 +%endif + lea r2, [r2 + r3 * 4] + lea r0, [r0 + r1 * 4] + movu xm0, [r0] ; m0 = row 8 + punpckhwd xm2, xm1, xm0 + punpcklwd xm1, xm0 + vinserti128 m1, m1, xm2, 1 + pmaddwd m2, m1, [r5 + 1 * mmsize] + pmaddwd m1, [r5] + paddd m5, m2 +%ifidn %1,sp + paddd m4, m7 + paddd m5, m7 + psrad m4, 12 + psrad m5, 12 +%else + psrad m4, 6 + psrad m5, 6 +%endif + packssdw m4, m5 + + movu xm2, [r0 + r1] ; m2 = row 9 + punpckhwd xm5, xm0, xm2 + punpcklwd xm0, xm2 + vinserti128 m0, m0, xm5, 1 + pmaddwd m0, [r5 + 1 * mmsize] + paddd m6, m0 + movu xm5, [r0 + r1 * 2] ; m5 = row 10 + punpckhwd xm0, xm2, xm5 + punpcklwd xm2, xm5 + vinserti128 m2, m2, xm0, 1 + pmaddwd m2, [r5 + 1 * mmsize] + paddd m1, m2 + +%ifidn %1,sp + paddd m6, m7 + paddd m1, m7 + psrad m6, 12 + psrad m1, 12 +%else + psrad m6, 6 + psrad m1, 6 +%endif + packssdw m6, m1 +%ifidn %1,sp + packuswb m4, m6 + vextracti128 xm6, m4, 1 + movd [r2], xm4 + pextrw [r2 + 4], xm6, 0 + pextrd [r2 + r3], xm4, 1 + pextrw [r2 + r3 + 4], xm6, 2 + pextrd [r2 + r3 * 2], xm4, 2 + pextrw [r2 + r3 * 2 + 4], xm6, 4 + pextrd [r2 + r4], xm4, 3 + pextrw [r2 + r4 + 4], xm6, 6 +%else + movq [r2], xm4 + movhps [r2 + r3], xm4 + movq [r2 + r3 * 2], xm6 + movhps [r2 + r4], xm6 + vextracti128 xm5, m4, 1 + vextracti128 xm1, m6, 1 + movd [r2 + 8], xm5 + pextrd [r2 + r3 + 8], xm5, 2 + movd [r2 + r3 * 2 + 8], xm1 + pextrd [r2 + r4 + 8], xm1, 2 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_S_AVX2_6x8 sp + FILTER_VER_CHROMA_S_AVX2_6x8 ss + +%macro FILTER_VER_CHROMA_S_AVX2_6x16 1 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_6x16, 4, 7, 9 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m8, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] +%ifidn %1,sp + paddd m0, m8 + paddd m1, m8 + psrad m0, 12 + psrad m1, 12 +%else + psrad m0, 6 + psrad m1, 6 +%endif + packssdw m0, m1 + + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm1, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m1 +%ifidn %1,sp + paddd m2, m8 + paddd m3, m8 + psrad m2, 12 + psrad m3, 12 +%else + psrad m2, 6 + psrad m3, 6 +%endif + packssdw m2, m3 +%ifidn %1,sp + packuswb m0, m2 + vextracti128 xm2, m0, 1 + movd [r2], xm0 + pextrw [r2 + 4], xm2, 0 + pextrd [r2 + r3], xm0, 1 + pextrw [r2 + r3 + 4], xm2, 2 + pextrd [r2 + r3 * 2], xm0, 2 + pextrw [r2 + r3 * 2 + 4], xm2, 4 + pextrd [r2 + r6], xm0, 3 + pextrw [r2 + r6 + 4], xm2, 6 +%else + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r6], xm2 + vextracti128 xm0, m0, 1 + vextracti128 xm3, m2, 1 + movd [r2 + 8], xm0 + pextrd [r2 + r3 + 8], xm0, 2 + movd [r2 + r3 * 2 + 8], xm3 + pextrd [r2 + r6 + 8], xm3, 2 +%endif + lea r2, [r2 + r3 * 4] + movu xm1, [r0 + r4] ; m1 = row 7 + punpckhwd xm0, xm6, xm1 + punpcklwd xm6, xm1 + vinserti128 m6, m6, xm0, 1 + pmaddwd m0, m6, [r5 + 1 * mmsize] + pmaddwd m6, [r5] + paddd m4, m0 + lea r0, [r0 + r1 * 4] + movu xm0, [r0] ; m0 = row 8 + punpckhwd xm2, xm1, xm0 + punpcklwd xm1, xm0 + vinserti128 m1, m1, xm2, 1 + pmaddwd m2, m1, [r5 + 1 * mmsize] + pmaddwd m1, [r5] + paddd m5, m2 +%ifidn %1,sp + paddd m4, m8 + paddd m5, m8 + psrad m4, 12 + psrad m5, 12 +%else + psrad m4, 6 + psrad m5, 6 +%endif + packssdw m4, m5 + + movu xm2, [r0 + r1] ; m2 = row 9 + punpckhwd xm5, xm0, xm2 + punpcklwd xm0, xm2 + vinserti128 m0, m0, xm5, 1 + pmaddwd m5, m0, [r5 + 1 * mmsize] + paddd m6, m5 + pmaddwd m0, [r5] + movu xm5, [r0 + r1 * 2] ; m5 = row 10 + punpckhwd xm7, xm2, xm5 + punpcklwd xm2, xm5 + vinserti128 m2, m2, xm7, 1 + pmaddwd m7, m2, [r5 + 1 * mmsize] + paddd m1, m7 + pmaddwd m2, [r5] + +%ifidn %1,sp + paddd m6, m8 + paddd m1, m8 + psrad m6, 12 + psrad m1, 12 +%else + psrad m6, 6 + psrad m1, 6 +%endif + packssdw m6, m1 +%ifidn %1,sp + packuswb m4, m6 + vextracti128 xm6, m4, 1 + movd [r2], xm4 + pextrw [r2 + 4], xm6, 0 + pextrd [r2 + r3], xm4, 1 + pextrw [r2 + r3 + 4], xm6, 2 + pextrd [r2 + r3 * 2], xm4, 2 + pextrw [r2 + r3 * 2 + 4], xm6, 4 + pextrd [r2 + r6], xm4, 3 + pextrw [r2 + r6 + 4], xm6, 6 +%else + movq [r2], xm4 + movhps [r2 + r3], xm4 + movq [r2 + r3 * 2], xm6 + movhps [r2 + r6], xm6 + vextracti128 xm4, m4, 1 + vextracti128 xm1, m6, 1 + movd [r2 + 8], xm4 + pextrd [r2 + r3 + 8], xm4, 2 + movd [r2 + r3 * 2 + 8], xm1 + pextrd [r2 + r6 + 8], xm1, 2 +%endif + lea r2, [r2 + r3 * 4] + movu xm7, [r0 + r4] ; m7 = row 11 + punpckhwd xm1, xm5, xm7 + punpcklwd xm5, xm7 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + paddd m0, m1 + pmaddwd m5, [r5] + lea r0, [r0 + r1 * 4] + movu xm1, [r0] ; m1 = row 12 + punpckhwd xm4, xm7, xm1 + punpcklwd xm7, xm1 + vinserti128 m7, m7, xm4, 1 + pmaddwd m4, m7, [r5 + 1 * mmsize] + paddd m2, m4 + pmaddwd m7, [r5] +%ifidn %1,sp + paddd m0, m8 + paddd m2, m8 + psrad m0, 12 + psrad m2, 12 +%else + psrad m0, 6 + psrad m2, 6 +%endif + packssdw m0, m2 + + movu xm4, [r0 + r1] ; m4 = row 13 + punpckhwd xm2, xm1, xm4 + punpcklwd xm1, xm4 + vinserti128 m1, m1, xm2, 1 + pmaddwd m2, m1, [r5 + 1 * mmsize] + paddd m5, m2 + pmaddwd m1, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 14 + punpckhwd xm6, xm4, xm2 + punpcklwd xm4, xm2 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m7, m6 + pmaddwd m4, [r5] +%ifidn %1,sp + paddd m5, m8 + paddd m7, m8 + psrad m5, 12 + psrad m7, 12 +%else + psrad m5, 6 + psrad m7, 6 +%endif + packssdw m5, m7 +%ifidn %1,sp + packuswb m0, m5 + vextracti128 xm5, m0, 1 + movd [r2], xm0 + pextrw [r2 + 4], xm5, 0 + pextrd [r2 + r3], xm0, 1 + pextrw [r2 + r3 + 4], xm5, 2 + pextrd [r2 + r3 * 2], xm0, 2 + pextrw [r2 + r3 * 2 + 4], xm5, 4 + pextrd [r2 + r6], xm0, 3 + pextrw [r2 + r6 + 4], xm5, 6 +%else + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm5 + movhps [r2 + r6], xm5 + vextracti128 xm0, m0, 1 + vextracti128 xm7, m5, 1 + movd [r2 + 8], xm0 + pextrd [r2 + r3 + 8], xm0, 2 + movd [r2 + r3 * 2 + 8], xm7 + pextrd [r2 + r6 + 8], xm7, 2 +%endif + lea r2, [r2 + r3 * 4] + + movu xm6, [r0 + r4] ; m6 = row 15 + punpckhwd xm5, xm2, xm6 + punpcklwd xm2, xm6 + vinserti128 m2, m2, xm5, 1 + pmaddwd m5, m2, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm0, [r0] ; m0 = row 16 + punpckhwd xm5, xm6, xm0 + punpcklwd xm6, xm0 + vinserti128 m6, m6, xm5, 1 + pmaddwd m5, m6, [r5 + 1 * mmsize] + paddd m4, m5 + pmaddwd m6, [r5] +%ifidn %1,sp + paddd m1, m8 + paddd m4, m8 + psrad m1, 12 + psrad m4, 12 +%else + psrad m1, 6 + psrad m4, 6 +%endif + packssdw m1, m4 + + movu xm5, [r0 + r1] ; m5 = row 17 + punpckhwd xm4, xm0, xm5 + punpcklwd xm0, xm5 + vinserti128 m0, m0, xm4, 1 + pmaddwd m0, [r5 + 1 * mmsize] + paddd m2, m0 + movu xm4, [r0 + r1 * 2] ; m4 = row 18 + punpckhwd xm0, xm5, xm4 + punpcklwd xm5, xm4 + vinserti128 m5, m5, xm0, 1 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m6, m5 +%ifidn %1,sp + paddd m2, m8 + paddd m6, m8 + psrad m2, 12 + psrad m6, 12 +%else + psrad m2, 6 + psrad m6, 6 +%endif + packssdw m2, m6 +%ifidn %1,sp + packuswb m1, m2 + vextracti128 xm2, m1, 1 + movd [r2], xm1 + pextrw [r2 + 4], xm2, 0 + pextrd [r2 + r3], xm1, 1 + pextrw [r2 + r3 + 4], xm2, 2 + pextrd [r2 + r3 * 2], xm1, 2 + pextrw [r2 + r3 * 2 + 4], xm2, 4 + pextrd [r2 + r6], xm1, 3 + pextrw [r2 + r6 + 4], xm2, 6 +%else + movq [r2], xm1 + movhps [r2 + r3], xm1 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r6], xm2 + vextracti128 xm4, m1, 1 + vextracti128 xm6, m2, 1 + movd [r2 + 8], xm4 + pextrd [r2 + r3 + 8], xm4, 2 + movd [r2 + r3 * 2 + 8], xm6 + pextrd [r2 + r6 + 8], xm6, 2 +%endif + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_6x16 sp + FILTER_VER_CHROMA_S_AVX2_6x16 ss + +;--------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vertical_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;--------------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_SS_W2_4R 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_ss_%1x%2, 5, 6, 5 + + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 5 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + + mov r4d, (%2/4) + +.loopH: + PROCESS_CHROMA_SP_W2_4R r5 + + psrad m0, 6 + psrad m2, 6 + + packssdw m0, m2 + + movd [r2], m0 + pextrd [r2 + r3], m0, 1 + lea r2, [r2 + 2 * r3] + pextrd [r2], m0, 2 + pextrd [r2 + r3], m0, 3 + + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loopH + + RET +%endmacro + + FILTER_VER_CHROMA_SS_W2_4R 2, 4 + FILTER_VER_CHROMA_SS_W2_4R 2, 8 + + FILTER_VER_CHROMA_SS_W2_4R 2, 16 + +;--------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_ss_4x2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;--------------------------------------------------------------------------------------------------------------- +INIT_XMM sse2 +cglobal interp_4tap_vert_ss_4x2, 5, 6, 4 + + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 5 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + + movq m0, [r0] + movq m1, [r0 + r1] + punpcklwd m0, m1 ;m0=[0 1] + pmaddwd m0, [r5 + 0 *16] ;m0=[0+1] Row1 + + lea r0, [r0 + 2 * r1] + movq m2, [r0] + punpcklwd m1, m2 ;m1=[1 2] + pmaddwd m1, [r5 + 0 *16] ;m1=[1+2] Row2 + + movq m3, [r0 + r1] + punpcklwd m2, m3 ;m4=[2 3] + pmaddwd m2, [r5 + 1 * 16] + paddd m0, m2 ;m0=[0+1+2+3] Row1 done + psrad m0, 6 + + movq m2, [r0 + 2 * r1] + punpcklwd m3, m2 ;m5=[3 4] + pmaddwd m3, [r5 + 1 * 16] + paddd m1, m3 ;m1=[1+2+3+4] Row2 done + psrad m1, 6 + + packssdw m0, m1 + + movlps [r2], m0 + movhps [r2 + r3], m0 + + RET + +;------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vertical_ss_6x8(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_SS_W6_H4 2 +INIT_XMM sse4 +cglobal interp_4tap_vert_ss_6x%2, 5, 7, 6 + + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 5 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r6, [r5 + r4] +%else + lea r6, [tab_ChromaCoeffV + r4] +%endif + + mov r4d, %2/4 + +.loopH: + PROCESS_CHROMA_SP_W4_4R + + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 + + movlps [r2], m0 + movhps [r2 + r3], m0 + lea r5, [r2 + 2 * r3] + movlps [r5], m2 + movhps [r5 + r3], m2 + + lea r5, [4 * r1 - 2 * 4] + sub r0, r5 + add r2, 2 * 4 + + PROCESS_CHROMA_SP_W2_4R r6 + + psrad m0, 6 + psrad m2, 6 + + packssdw m0, m2 + + movd [r2], m0 + pextrd [r2 + r3], m0, 1 + lea r2, [r2 + 2 * r3] + pextrd [r2], m0, 2 + pextrd [r2 + r3], m0, 3 + + sub r0, 2 * 4 + lea r2, [r2 + 2 * r3 - 2 * 4] + + dec r4d + jnz .loopH + + RET +%endmacro + + FILTER_VER_CHROMA_SS_W6_H4 6, 8 + + FILTER_VER_CHROMA_SS_W6_H4 6, 16 + + +;---------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert_ss_8x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;---------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_SS_W8_H2 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_ss_%1x%2, 5, 6, 7 + + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 5 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + + mov r4d, %2/2 +.loopH: + PROCESS_CHROMA_SP_W8_2R + + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 + + movu [r2], m0 + movu [r2 + r3], m2 + + lea r2, [r2 + 2 * r3] + + dec r4d + jnz .loopH + + RET +%endmacro + + FILTER_VER_CHROMA_SS_W8_H2 8, 2 + FILTER_VER_CHROMA_SS_W8_H2 8, 4 + FILTER_VER_CHROMA_SS_W8_H2 8, 6 + FILTER_VER_CHROMA_SS_W8_H2 8, 8 + FILTER_VER_CHROMA_SS_W8_H2 8, 16 + FILTER_VER_CHROMA_SS_W8_H2 8, 32 + + FILTER_VER_CHROMA_SS_W8_H2 8, 12 + FILTER_VER_CHROMA_SS_W8_H2 8, 64 + ;----------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------------------------------------------- @@ -9353,3 +25464,2816 @@ FILTER_VER_LUMA_S_AVX2_32x24 sp FILTER_VER_LUMA_S_AVX2_32x24 ss + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_32x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;-----------------------------------------------------------------------------------------------------------------------------; +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_32x32, 4,6,8 + mov r4d, r4m + add r3d, r3d + dec r0 + + ; check isRowExt + cmp r5m, byte 0 + + lea r5, [tab_ChromaCoeff] + vpbroadcastw m0, [r5 + r4 * 4 + 0] + vpbroadcastw m1, [r5 + r4 * 4 + 2] + mova m7, [pw_2000] + + ; register map + ; m0 - interpolate coeff Low + ; m1 - interpolate coeff High + ; m7 - constant pw_2000 + mov r4d, 32 + je .loop + sub r0, r1 + add r4d, 3 + +.loop + ; Row 0 + movu m2, [r0] + movu m3, [r0 + 1] + punpckhbw m4, m2, m3 + punpcklbw m2, m3 + pmaddubsw m4, m0 + pmaddubsw m2, m0 + + movu m3, [r0 + 2] + movu m5, [r0 + 3] + punpckhbw m6, m3, m5 + punpcklbw m3, m5 + pmaddubsw m6, m1 + pmaddubsw m3, m1 + + paddw m4, m6 + paddw m2, m3 + psubw m4, m7 + psubw m2, m7 + vperm2i128 m3, m2, m4, 0x20 + vperm2i128 m5, m2, m4, 0x31 + movu [r2], m3 + movu [r2 + mmsize], m5 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_16x16(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;-----------------------------------------------------------------------------------------------------------------------------; +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_16x16, 4,7,6 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m5, [pw_2000] + mova m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + mov r6d, 16 + dec r0 + test r5d, r5d + je .loop + sub r0 , r1 + add r6d , 3 + +.loop + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 8] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2], m3 + + add r2, r3 + add r0, r1 + dec r6d + jnz .loop + RET + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_16xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_PS_16xN_AVX2 2 +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_%1x%2, 4,7,6 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m5, [pw_2000] + mova m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + mov r6d, %2 + dec r0 + test r5d, r5d + je .loop + sub r0 , r1 + add r6d , 3 + +.loop + ; Row 0 + vbroadcasti128 m3, [r0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 8] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + + vpermq m3, m3, 11011000b + movu [r2], m3 + + add r2, r3 + add r0, r1 + dec r6d + jnz .loop + RET +%endmacro + + IPFILTER_CHROMA_PS_16xN_AVX2 16 , 32 + IPFILTER_CHROMA_PS_16xN_AVX2 16 , 12 + IPFILTER_CHROMA_PS_16xN_AVX2 16 , 8 + IPFILTER_CHROMA_PS_16xN_AVX2 16 , 4 + IPFILTER_CHROMA_PS_16xN_AVX2 16 , 24 + IPFILTER_CHROMA_PS_16xN_AVX2 16 , 64 + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_32xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_PS_32xN_AVX2 2 +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_%1x%2, 4,7,6 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m5, [pw_2000] + mova m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + mov r6d, %2 + dec r0 + test r5d, r5d + je .loop + sub r0 , r1 + add r6d , 3 + +.loop + ; Row 0 + vbroadcasti128 m3, [r0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 8] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + + vpermq m3, m3, 11011000b + movu [r2], m3 + + vbroadcasti128 m3, [r0 + 16] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 24] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + + vpermq m3, m3, 11011000b + movu [r2 + 32], m3 + + add r2, r3 + add r0, r1 + dec r6d + jnz .loop + RET +%endmacro + + IPFILTER_CHROMA_PS_32xN_AVX2 32 , 16 + IPFILTER_CHROMA_PS_32xN_AVX2 32 , 24 + IPFILTER_CHROMA_PS_32xN_AVX2 32 , 8 + IPFILTER_CHROMA_PS_32xN_AVX2 32 , 64 + IPFILTER_CHROMA_PS_32xN_AVX2 32 , 48 +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_4x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_4x4, 4,7,5 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + test r5d, r5d + je .label + sub r0 , r1 + +.label + ; Row 0-1 + movu xm3, [r0] + vinserti128 m3, m3, [r0 + r1], 1 + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 2-3 + lea r0, [r0 + r1 * 2] + movu xm4, [r0] + vinserti128 m4, m4, [r0 + r1], 1 + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, [pw_2000] + vextracti128 xm4, m3, 1 + movq [r2], xm3 + movq [r2+r3], xm4 + lea r2, [r2 + r3 * 2] + movhps [r2], xm3 + movhps [r2 + r3], xm4 + + test r5d, r5d + jz .end + lea r2, [r2 + r3 * 2] + lea r0, [r0 + r1 * 2] + + ;Row 5-6 + movu xm3, [r0] + vinserti128 m3, m3, [r0 + r1], 1 + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 7 + lea r0, [r0 + r1 * 2] + vbroadcasti128 m4, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, [pw_2000] + + vextracti128 xm4, m3, 1 + movq [r2], xm3 + movq [r2+r3], xm4 + lea r2, [r2 + r3 * 2] + movhps [r2], xm3 +.end + RET + +cglobal interp_4tap_horiz_ps_4x2, 4,7,5 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + test r5d, r5d + je .label + sub r0 , r1 + +.label + ; Row 0-1 + movu xm3, [r0] + vinserti128 m3, m3, [r0 + r1], 1 + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + packssdw m3, m3 + psubw m3, [pw_2000] + vextracti128 xm4, m3, 1 + movq [r2], xm3 + movq [r2+r3], xm4 + + test r5d, r5d + jz .end + lea r2, [r2 + r3 * 2] + lea r0, [r0 + r1 * 2] + + ;Row 2-3 + movu xm3, [r0] + vinserti128 m3, m3, [r0 + r1], 1 + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 5 + lea r0, [r0 + r1 * 2] + vbroadcasti128 m4, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, [pw_2000] + + vextracti128 xm4, m3, 1 + movq [r2], xm3 + movq [r2+r3], xm4 + lea r2, [r2 + r3 * 2] + movhps [r2], xm3 +.end + RET + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_4xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;-----------------------------------------------------------------------------------------------------------------------------; +%macro IPFILTER_CHROMA_PS_4xN_AVX2 2 +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_%1x%2, 4,7,5 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + mov r4, %2 + dec r0 + test r5d, r5d + je .loop + sub r0 , r1 + + +.loop + sub r4d, 4 + ; Row 0-1 + movu xm3, [r0] + vinserti128 m3, m3, [r0 + r1], 1 + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 2-3 + lea r0, [r0 + r1 * 2] + movu xm4, [r0] + vinserti128 m4, m4, [r0 + r1], 1 + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, [pw_2000] + vextracti128 xm4, m3, 1 + movq [r2], xm3 + movq [r2+r3], xm4 + lea r2, [r2 + r3 * 2] + movhps [r2], xm3 + movhps [r2 + r3], xm4 + + lea r2, [r2 + r3 * 2] + lea r0, [r0 + r1 * 2] + + test r4d, r4d + jnz .loop + test r5d, r5d + jz .end + + ;Row 5-6 + movu xm3, [r0] + vinserti128 m3, m3, [r0 + r1], 1 + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 7 + lea r0, [r0 + r1 * 2] + vbroadcasti128 m4, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, [pw_2000] + + vextracti128 xm4, m3, 1 + movq [r2], xm3 + movq [r2+r3], xm4 + lea r2, [r2 + r3 * 2] + movhps [r2], xm3 +.end + RET +%endmacro + + IPFILTER_CHROMA_PS_4xN_AVX2 4 , 8 + IPFILTER_CHROMA_PS_4xN_AVX2 4 , 16 +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_8x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;-----------------------------------------------------------------------------------------------------------------------------; +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_8x8, 4,7,6 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m5, [pw_2000] + mova m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + mov r6d, 4 + dec r0 + test r5d, r5d + je .loop + sub r0 , r1 + add r6d , 1 + +.loop + dec r6d + ; Row 0 + vbroadcasti128 m3, [r0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + + vpermq m3, m3, 11011000b + vextracti128 xm4, m3, 1 + movu [r2], xm3 + movu [r2 + r3], xm4 + + lea r2, [r2 + r3 * 2] + lea r0, [r0 + r1 * 2] + test r6d, r6d + jnz .loop + test r5d, r5d + je .end + + ;Row 11 + vbroadcasti128 m3, [r0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + packssdw m3, m3 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2], xm3 +.end + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_4x2, 4,6,4 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti128 m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + + ; Row 0-1 + movu xm2, [r0 - 1] + vinserti128 m2, m2, [r0 + r1 - 1], 1 + pshufb m2, m1 + pmaddubsw m2, m0 + pmaddwd m2, [pw_1] + + packssdw m2, m2 + pmulhrsw m2, [pw_512] + vextracti128 xm3, m2, 1 + packuswb xm2, xm3 + + movd [r2], xm2 + pextrd [r2+r3], xm2, 2 + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_32xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_PP_32xN_AVX2 2 +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_%1x%2, 4,6,7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m1, [interp4_horiz_shuf1] + vpbroadcastd m2, [pw_1] + mova m6, [pw_512] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + mov r4d, %2 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 4] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + vbroadcasti128 m4, [r0 + 16] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + 20] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vpermq m3, m3, 11011000b + + movu [r2], m3 + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET +%endmacro + + IPFILTER_CHROMA_PP_32xN_AVX2 32, 16 + IPFILTER_CHROMA_PP_32xN_AVX2 32, 24 + IPFILTER_CHROMA_PP_32xN_AVX2 32, 8 + IPFILTER_CHROMA_PP_32xN_AVX2 32, 64 + IPFILTER_CHROMA_PP_32xN_AVX2 32, 48 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_8xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_PP_8xN_AVX2 2 +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_%1x%2, 4,6,6 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + movu m1, [tab_Tm] + vpbroadcastd m2, [pw_1] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + sub r0, 1 + mov r4d, %2 + +.loop: + sub r4d, 4 + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, [pw_512] + lea r0, [r0 + r1 * 2] + + ; Row 2 + vbroadcasti128 m4, [r0 ] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + ; Row 3 + vbroadcasti128 m5, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, [pw_512] + + packuswb m3, m4 + mova m5, [interp_4tap_8x8_horiz_shuf] + vpermd m3, m5, m3 + vextracti128 xm4, m3, 1 + movq [r2], xm3 + movhps [r2 + r3], xm3 + lea r2, [r2 + r3 * 2] + movq [r2], xm4 + movhps [r2 + r3], xm4 + lea r2, [r2 + r3 * 2] + lea r0, [r0 + r1*2] + test r4d, r4d + jnz .loop + RET +%endmacro + + IPFILTER_CHROMA_PP_8xN_AVX2 8 , 16 + IPFILTER_CHROMA_PP_8xN_AVX2 8 , 32 + IPFILTER_CHROMA_PP_8xN_AVX2 8 , 4 + IPFILTER_CHROMA_PP_8xN_AVX2 8 , 64 + IPFILTER_CHROMA_PP_8xN_AVX2 8 , 12 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_4xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_PP_4xN_AVX2 2 +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_%1x%2, 4,6,6 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vpbroadcastd m2, [pw_1] + vbroadcasti128 m1, [tab_Tm] + mov r4d, %2 + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + +.loop + sub r4d, 4 + ; Row 0-1 + movu xm3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + vinserti128 m3, m3, [r0 + r1], 1 + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 2-3 + lea r0, [r0 + r1 * 2] + movu xm4, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + vinserti128 m4, m4, [r0 + r1], 1 + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + pmulhrsw m3, [pw_512] + vextracti128 xm4, m3, 1 + packuswb xm3, xm4 + + movd [r2], xm3 + pextrd [r2+r3], xm3, 2 + lea r2, [r2 + r3 * 2] + pextrd [r2], xm3, 1 + pextrd [r2+r3], xm3, 3 + + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] + test r4d, r4d + jnz .loop + RET +%endmacro + + IPFILTER_CHROMA_PP_4xN_AVX2 4 , 8 + IPFILTER_CHROMA_PP_4xN_AVX2 4 , 16 + +%macro IPFILTER_LUMA_PS_32xN_AVX2 2 +INIT_YMM avx2 +cglobal interp_8tap_horiz_ps_%1x%2, 4, 7, 8 + mov r5d, r5m + mov r4d, r4m +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + mova m6, [tab_Lm + 32] + mova m1, [tab_Lm] + mov r4d, %2 ;height + add r3d, r3d + vbroadcasti128 m2, [pw_1] + mova m7, [interp8_hps_shuf] + + ; register map + ; m0 - interpolate coeff + ; m1 , m6 - shuffle order table + ; m2 - pw_1 + + + sub r0, 3 + test r5d, r5d + jz .label + lea r6, [r1 * 3] ; r8 = (N / 2 - 1) * srcStride + sub r0, r6 + add r4d, 7 + +.label + lea r6, [pw_2000] +.loop + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m3, m6 ; row 0 (col 4 to 7) + pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3) + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 + + + vbroadcasti128 m4, [r0 + 8] + pshufb m5, m4, m6 ;row 0 (col 12 to 15) + pshufb m4, m1 ;row 0 (col 8 to 11) + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmaddwd m4, m2 + pmaddwd m5, m2 + packssdw m4, m5 + + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 + vpermd m3, m7, m3 + psubw m3, [r6] + + movu [r2], m3 ;row 0 + + vbroadcasti128 m3, [r0 + 16] + pshufb m4, m3, m6 ; row 0 (col 20 to 23) + pshufb m3, m1 ; row 0 (col 16 to 19) + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 + + vbroadcasti128 m4, [r0 + 24] + pshufb m5, m4, m6 ;row 0 (col 28 to 31) + pshufb m4, m1 ;row 0 (col 24 to 27) + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmaddwd m4, m2 + pmaddwd m5, m2 + packssdw m4, m5 + + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 + vpermd m3, m7, m3 + psubw m3, [r6] + + movu [r2 + 32], m3 ;row 0 + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET +%endmacro + + IPFILTER_LUMA_PS_32xN_AVX2 32 , 32 + IPFILTER_LUMA_PS_32xN_AVX2 32 , 16 + IPFILTER_LUMA_PS_32xN_AVX2 32 , 24 + IPFILTER_LUMA_PS_32xN_AVX2 32 , 8 + IPFILTER_LUMA_PS_32xN_AVX2 32 , 64 + +INIT_YMM avx2 +cglobal interp_8tap_horiz_ps_48x64, 4, 7, 8 + mov r5d, r5m + mov r4d, r4m +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + mova m6, [tab_Lm + 32] + mova m1, [tab_Lm] + mov r4d, 64 ;height + add r3d, r3d + vbroadcasti128 m2, [pw_2000] + mova m7, [pw_1] + + ; register map + ; m0 - interpolate coeff + ; m1 , m6 - shuffle order table + ; m2 - pw_2000 + + sub r0, 3 + test r5d, r5d + jz .label + lea r6, [r1 * 3] ; r6 = (N / 2 - 1) * srcStride + sub r0, r6 ; r0(src)-r6 + add r4d, 7 ; blkheight += N - 1 (7 - 1 = 6 ; since the last one row not in loop) + +.label + lea r6, [interp8_hps_shuf] +.loop + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m3, m6 ; row 0 (col 4 to 7) + pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3) + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + + vbroadcasti128 m4, [r0 + 8] + pshufb m5, m4, m6 ;row 0 (col 12 to 15) + pshufb m4, m1 ;row 0 (col 8 to 11) + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmaddwd m4, m7 + pmaddwd m5, m7 + packssdw m4, m5 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + mova m5, [r6] + vpermd m3, m5, m3 + psubw m3, m2 + movu [r2], m3 ;row 0 + + vbroadcasti128 m3, [r0 + 16] + pshufb m4, m3, m6 ; row 0 (col 20 to 23) + pshufb m3, m1 ; row 0 (col 16 to 19) + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + + vbroadcasti128 m4, [r0 + 24] + pshufb m5, m4, m6 ;row 0 (col 28 to 31) + pshufb m4, m1 ;row 0 (col 24 to 27) + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmaddwd m4, m7 + pmaddwd m5, m7 + packssdw m4, m5 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + mova m5, [r6] + vpermd m3, m5, m3 + psubw m3, m2 + movu [r2 + 32], m3 ;row 0 + + vbroadcasti128 m3, [r0 + 32] + pshufb m4, m3, m6 ; row 0 (col 36 to 39) + pshufb m3, m1 ; row 0 (col 32 to 35) + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + + vbroadcasti128 m4, [r0 + 40] + pshufb m5, m4, m6 ;row 0 (col 44 to 47) + pshufb m4, m1 ;row 0 (col 40 to 43) + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmaddwd m4, m7 + pmaddwd m5, m7 + packssdw m4, m5 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + mova m5, [r6] + vpermd m3, m5, m3 + psubw m3, m2 + movu [r2 + 64], m3 ;row 0 + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET + +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_24x32, 4,6,8 + sub r0, 3 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4 * 8] + vpbroadcastd m1, [tab_LumaCoeff + r4 * 8 + 4] +%endif + movu m3, [tab_Tm + 16] + vpbroadcastd m7, [pw_1] + lea r5, [tab_Tm] + + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m2 shuffle order table + ; m7 - pw_1 + + mov r4d, 32 +.loop: + ; Row 0 + vbroadcasti128 m4, [r0] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m4, m3 + pshufb m4, [r5] + pmaddubsw m4, m0 + pmaddubsw m5, m1 + paddw m4, m5 + pmaddwd m4, m7 + + vbroadcasti128 m5, [r0 + 8] + pshufb m6, m5, m3 + pshufb m5, [r5] + pmaddubsw m5, m0 + pmaddubsw m6, m1 + paddw m5, m6 + pmaddwd m5, m7 + packssdw m4, m5 ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00] + pmulhrsw m4, [pw_512] + + vbroadcasti128 m2, [r0 + 16] + pshufb m5, m2, m3 + pshufb m2, [r5] + pmaddubsw m2, m0 + pmaddubsw m5, m1 + paddw m2, m5 + pmaddwd m2, m7 + + packssdw m2, m2 + pmulhrsw m2, [pw_512] + packuswb m4, m2 + vpermq m4, m4, 11011000b + vextracti128 xm5, m4, 1 + pshufd xm4, xm4, 11011000b + pshufd xm5, xm5, 11011000b + + movu [r2], xm4 + movq [r2 + 16], xm5 + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET + +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_12x16, 4,6,8 + sub r0, 3 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4 * 8] + vpbroadcastd m1, [tab_LumaCoeff + r4 * 8 + 4] +%endif + movu m3, [tab_Tm + 16] + vpbroadcastd m7, [pw_1] + lea r5, [tab_Tm] + + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m2 shuffle order table + ; m7 - pw_1 + + mov r4d, 8 +.loop: + ; Row 0 + vbroadcasti128 m4, [r0] ;first 8 element + pshufb m5, m4, m3 + pshufb m4, [r5] + pmaddubsw m4, m0 + pmaddubsw m5, m1 + paddw m4, m5 + pmaddwd m4, m7 + + vbroadcasti128 m5, [r0 + 8] ; element 8 to 11 + pshufb m6, m5, m3 + pshufb m5, [r5] + pmaddubsw m5, m0 + pmaddubsw m6, m1 + paddw m5, m6 + pmaddwd m5, m7 + + packssdw m4, m5 ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00] + pmulhrsw m4, [pw_512] + + ;Row 1 + vbroadcasti128 m2, [r0 + r1] + pshufb m5, m2, m3 + pshufb m2, [r5] + pmaddubsw m2, m0 + pmaddubsw m5, m1 + paddw m2, m5 + pmaddwd m2, m7 + + vbroadcasti128 m5, [r0 + r1 + 8] + pshufb m6, m5, m3 + pshufb m5, [r5] + pmaddubsw m5, m0 + pmaddubsw m6, m1 + paddw m5, m6 + pmaddwd m5, m7 + + packssdw m2, m5 + pmulhrsw m2, [pw_512] + packuswb m4, m2 + vpermq m4, m4, 11011000b + vextracti128 xm5, m4, 1 + pshufd xm4, xm4, 11011000b + pshufd xm5, xm5, 11011000b + + movq [r2], xm4 + pextrd [r2+8], xm4, 2 + movq [r2 + r3], xm5 + pextrd [r2+r3+8], xm5, 2 + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] + dec r4d + jnz .loop + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_16xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_PP_16xN_AVX2 2 +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m6, [pw_512] + mova m1, [interp4_horiz_shuf1] + vpbroadcastd m2, [pw_1] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + mov r4d, %2/2 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + r1 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vpermq m3, m3, 11011000b + + vextracti128 xm4, m3, 1 + movu [r2], xm3 + movu [r2 + r3], xm4 + lea r2, [r2 + r3 * 2] + lea r0, [r0 + r1 * 2] + dec r4d + jnz .loop + RET +%endmacro + + IPFILTER_CHROMA_PP_16xN_AVX2 16 , 8 + IPFILTER_CHROMA_PP_16xN_AVX2 16 , 32 + IPFILTER_CHROMA_PP_16xN_AVX2 16 , 12 + IPFILTER_CHROMA_PP_16xN_AVX2 16 , 4 + IPFILTER_CHROMA_PP_16xN_AVX2 16 , 64 + IPFILTER_CHROMA_PP_16xN_AVX2 16 , 24 + +%macro IPFILTER_LUMA_PS_64xN_AVX2 1 +INIT_YMM avx2 +cglobal interp_8tap_horiz_ps_64x%1, 4, 7, 8 + mov r5d, r5m + mov r4d, r4m +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + mova m6, [tab_Lm + 32] + mova m1, [tab_Lm] + mov r4d, %1 ;height + add r3d, r3d + vbroadcasti128 m2, [pw_1] + mova m7, [interp8_hps_shuf] + + ; register map + ; m0 - interpolate coeff + ; m1 , m6 - shuffle order table + ; m2 - pw_2000 + + sub r0, 3 + test r5d, r5d + jz .label + lea r6, [r1 * 3] + sub r0, r6 ; r0(src)-r6 + add r4d, 7 ; blkheight += N - 1 + +.label + lea r6, [pw_2000] +.loop + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m3, m6 ; row 0 (col 4 to 7) + pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3) + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 + + vbroadcasti128 m4, [r0 + 8] + pshufb m5, m4, m6 ;row 0 (col 12 to 15) + pshufb m4, m1 ;row 0 (col 8 to 11) + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmaddwd m4, m2 + pmaddwd m5, m2 + packssdw m4, m5 + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 + vpermd m3, m7, m3 + psubw m3, [r6] + movu [r2], m3 ;row 0 + + vbroadcasti128 m3, [r0 + 16] + pshufb m4, m3, m6 ; row 0 (col 20 to 23) + pshufb m3, m1 ; row 0 (col 16 to 19) + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 + + vbroadcasti128 m4, [r0 + 24] + pshufb m5, m4, m6 ;row 0 (col 28 to 31) + pshufb m4, m1 ;row 0 (col 24 to 27) + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmaddwd m4, m2 + pmaddwd m5, m2 + packssdw m4, m5 + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 + vpermd m3, m7, m3 + psubw m3, [r6] + movu [r2 + 32], m3 ;row 0 + + vbroadcasti128 m3, [r0 + 32] + pshufb m4, m3, m6 ; row 0 (col 36 to 39) + pshufb m3, m1 ; row 0 (col 32 to 35) + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 + + vbroadcasti128 m4, [r0 + 40] + pshufb m5, m4, m6 ;row 0 (col 44 to 47) + pshufb m4, m1 ;row 0 (col 40 to 43) + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmaddwd m4, m2 + pmaddwd m5, m2 + packssdw m4, m5 + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 + vpermd m3, m7, m3 + psubw m3, [r6] + movu [r2 + 64], m3 ;row 0 + vbroadcasti128 m3, [r0 + 48] + pshufb m4, m3, m6 ; row 0 (col 52 to 55) + pshufb m3, m1 ; row 0 (col 48 to 51) + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 + + vbroadcasti128 m4, [r0 + 56] + pshufb m5, m4, m6 ;row 0 (col 60 to 63) + pshufb m4, m1 ;row 0 (col 56 to 59) + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmaddwd m4, m2 + pmaddwd m5, m2 + packssdw m4, m5 + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 + vpermd m3, m7, m3 + psubw m3, [r6] + movu [r2 + 96], m3 ;row 0 + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET +%endmacro + + IPFILTER_LUMA_PS_64xN_AVX2 64 + IPFILTER_LUMA_PS_64xN_AVX2 48 + IPFILTER_LUMA_PS_64xN_AVX2 32 + IPFILTER_LUMA_PS_64xN_AVX2 16 + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_8xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_PS_8xN_AVX2 1 +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_8x%1, 4,7,6 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m5, [pw_2000] + mova m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + mov r6d, %1/2 + dec r0 + test r5d, r5d + jz .loop + sub r0 , r1 + inc r6d + +.loop + ; Row 0 + vbroadcasti128 m3, [r0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, 11011000b + vextracti128 xm4, m3, 1 + movu [r2], xm3 + movu [r2 + r3], xm4 + + lea r2, [r2 + r3 * 2] + lea r0, [r0 + r1 * 2] + dec r6d + jnz .loop + test r5d, r5d + jz .end + + ;Row 11 + vbroadcasti128 m3, [r0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + packssdw m3, m3 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2], xm3 +.end + RET +%endmacro + + IPFILTER_CHROMA_PS_8xN_AVX2 2 + IPFILTER_CHROMA_PS_8xN_AVX2 32 + IPFILTER_CHROMA_PS_8xN_AVX2 16 + IPFILTER_CHROMA_PS_8xN_AVX2 6 + IPFILTER_CHROMA_PS_8xN_AVX2 4 + IPFILTER_CHROMA_PS_8xN_AVX2 12 + IPFILTER_CHROMA_PS_8xN_AVX2 64 + +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_2x4, 4, 7, 3 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova xm3, [pw_2000] + dec r0 + test r5d, r5d + jz .label + sub r0, r1 + +.label + lea r6, [r1 * 3] + movq xm1, [r0] + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r6] + + vinserti128 m1, m1, xm2, 1 + pshufb m1, [interp4_hpp_shuf] + pmaddubsw m1, m0 + pmaddwd m1, [pw_1] + vextracti128 xm2, m1, 1 + packssdw xm1, xm2 + psubw xm1, xm3 + + lea r4, [r3 * 3] + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + pextrd [r2 + r3 * 2], xm1, 2 + pextrd [r2 + r4], xm1, 3 + + test r5d, r5d + jz .end + lea r2, [r2 + r3 * 4] + lea r0, [r0 + r1 * 4] + + movq xm1, [r0] + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + vinserti128 m1, m1, xm2, 1 + pshufb m1, [interp4_hpp_shuf] + pmaddubsw m1, m0 + pmaddwd m1, [pw_1] + vextracti128 xm2, m1, 1 + packssdw xm1, xm2 + psubw xm1, xm3 + + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + pextrd [r2 + r3 * 2], xm1, 2 +.end + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_2x8, 4, 7, 7 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + vbroadcasti128 m6, [pw_2000] + test r5d, r5d + jz .label + sub r0, r1 + +.label + mova m4, [interp4_hpp_shuf] + mova m5, [pw_1] + dec r0 + lea r4, [r1 * 3] + movq xm1, [r0] ;row 0 + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m1, m1, xm2, 1 + lea r0, [r0 + r1 * 4] + movq xm3, [r0] + movhps xm3, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m3, m3, xm2, 1 + + pshufb m1, m4 + pshufb m3, m4 + pmaddubsw m1, m0 + pmaddubsw m3, m0 + pmaddwd m1, m5 + pmaddwd m3, m5 + packssdw m1, m3 + psubw m1, m6 + + lea r4, [r3 * 3] + vextracti128 xm2, m1, 1 + + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + movd [r2 + r3 * 2], xm2 + pextrd [r2 + r4], xm2, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm1, 2 + pextrd [r2 + r3], xm1, 3 + pextrd [r2 + r3 * 2], xm2, 2 + pextrd [r2 + r4], xm2, 3 + test r5d, r5d + jz .end + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + movq xm1, [r0] ;row 0 + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + vinserti128 m1, m1, xm2, 1 + pshufb m1, m4 + pmaddubsw m1, m0 + pmaddwd m1, m5 + packssdw m1, m1 + psubw m1, m6 + vextracti128 xm2, m1, 1 + + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + movd [r2 + r3 * 2], xm2 +.end + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_12x16, 4, 6, 7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m6, [pw_512] + mova m1, [interp4_horiz_shuf1] + vpbroadcastd m2, [pw_1] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + mov r4d, 8 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + r1 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vpermq m3, m3, 11011000b + + vextracti128 xm4, m3, 1 + movq [r2], xm3 + pextrd [r2+8], xm3, 2 + movq [r2 + r3], xm4 + pextrd [r2 + r3 + 8],xm4, 2 + lea r2, [r2 + r3 * 2] + lea r0, [r0 + r1 * 2] + dec r4d + jnz .loop + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_24x32, 4,6,7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m1, [interp4_horiz_shuf1] + vpbroadcastd m2, [pw_1] + mova m6, [pw_512] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + mov r4d, 32 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 4] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + vbroadcasti128 m4, [r0 + 16] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + 20] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vpermq m3, m3, 11011000b + + vextracti128 xm4, m3, 1 + movu [r2], xm3 + movq [r2 + 16], xm4 + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_6x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;-----------------------------------------------------------------------------------------------------------------------------; +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_6x8, 4,7,6 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m5, [pw_2000] + mova m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + mov r6d, 8/2 + dec r0 + test r5d, r5d + jz .loop + sub r0 , r1 + inc r6d + +.loop + ; Row 0 + vbroadcasti128 m3, [r0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, 11011000b + vextracti128 xm4, m3, 1 + movq [r2], xm3 + pextrd [r2 + 8], xm3, 2 + movq [r2 + r3], xm4 + pextrd [r2 + r3 + 8], xm4, 2 + lea r2, [r2 + r3 * 2] + lea r0, [r0 + r1 * 2] + dec r6d + jnz .loop + test r5d, r5d + jz .end + + ;Row 11 + vbroadcasti128 m3, [r0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + packssdw m3, m3 + psubw m3, m5 + vextracti128 xm4, m3, 1 + movq [r2], xm3 + movd [r2+8], xm4 +.end + RET + +INIT_YMM avx2 +cglobal interp_8tap_horiz_ps_12x16, 6, 7, 8 + mov r5d, r5m + mov r4d, r4m +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + mova m6, [tab_Lm + 32] + mova m1, [tab_Lm] + add r3d, r3d + vbroadcasti128 m2, [pw_2000] + mov r4d, 16 + vbroadcasti128 m7, [pw_1] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - pw_2000 + + mova m5, [interp8_hps_shuf] + sub r0, 3 + test r5d, r5d + jz .loop + lea r6, [r1 * 3] ; r6 = (N / 2 - 1) * srcStride + sub r0, r6 ; r0(src)-r6 + add r4d, 7 +.loop + + ; Row 0 + + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m3, m6 + pshufb m3, m1 ; shuffled based on the col order tab_Lm + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m7 + packssdw m4, m4 + + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + + vpermd m3, m5, m3 + psubw m3, m2 + + vextracti128 xm4, m3, 1 + movu [r2], xm3 ;row 0 + movq [r2 + 16], xm4 ;row 1 + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET + +INIT_YMM avx2 +cglobal interp_8tap_horiz_ps_24x32, 4, 7, 8 + mov r5d, r5m + mov r4d, r4m +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + mova m6, [tab_Lm + 32] + mova m1, [tab_Lm] + mov r4d, 32 ;height + add r3d, r3d + vbroadcasti128 m2, [pw_2000] + vbroadcasti128 m7, [pw_1] + + ; register map + ; m0 - interpolate coeff + ; m1 , m6 - shuffle order table + ; m2 - pw_2000 + + sub r0, 3 + test r5d, r5d + jz .label + lea r6, [r1 * 3] ; r6 = (N / 2 - 1) * srcStride + sub r0, r6 ; r0(src)-r6 + add r4d, 7 ; blkheight += N - 1 (7 - 1 = 6 ; since the last one row not in loop) + +.label + lea r6, [interp8_hps_shuf] +.loop + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m3, m6 ; row 0 (col 4 to 7) + pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3) + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m4, m6 ;row 1 (col 4 to 7) + pshufb m4, m1 ;row 1 (col 0 to 3) + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmaddwd m4, m7 + pmaddwd m5, m7 + packssdw m4, m5 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + mova m5, [r6] + vpermd m3, m5, m3 + psubw m3, m2 + movu [r2], m3 ;row 0 + + vbroadcasti128 m3, [r0 + 16] + pshufb m4, m3, m6 + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + mova m4, [r6] + vpermd m3, m4, m3 + psubw m3, m2 + movu [r2 + 32], xm3 ;row 0 + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_24x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_24x32, 4,7,6 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m5, [pw_2000] + mova m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + mov r6d, 32 + dec r0 + test r5d, r5d + je .loop + sub r0 , r1 + add r6d , 3 + +.loop + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2], m3 + + vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + packssdw m3, m3 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2 + 32], xm3 + + add r2, r3 + add r0, r1 + dec r6d + jnz .loop + RET + +;----------------------------------------------------------------------------------------------------------------------- +;macro FILTER_H8_W8_16N_AVX2 +;----------------------------------------------------------------------------------------------------------------------- +%macro FILTER_H8_W8_16N_AVX2 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m3, m6 ; row 0 (col 4 to 7) + pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3) + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] + + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m4, m6 ;row 1 (col 4 to 7) + pshufb m4, m1 ;row 1 (col 0 to 3) + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmaddwd m4, m2 + pmaddwd m5, m2 + packssdw m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] + + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 ; all rows and col completed. + + mova m5, [interp8_hps_shuf] + vpermd m3, m5, m3 + psubw m3, m8 + + vextracti128 xm4, m3, 1 + mova [r4], xm3 + mova [r4 + 16], xm4 + %endmacro + +;----------------------------------------------------------------------------- +; void interp_8tap_hv_pp_16x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY) +;----------------------------------------------------------------------------- +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_hv_pp_16x16, 4, 10, 15, 0-31*32 +%define stk_buf1 rsp + mov r4d, r4m + mov r5d, r5m +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + + xor r6, r6 + mov r4, rsp + mova m6, [tab_Lm + 32] + mova m1, [tab_Lm] + mov r8, 16 ;height + vbroadcasti128 m8, [pw_2000] + vbroadcasti128 m2, [pw_1] + sub r0, 3 + lea r7, [r1 * 3] ; r7 = (N / 2 - 1) * srcStride + sub r0, r7 ; r0(src)-r7 + add r8, 7 + +.loopH: + FILTER_H8_W8_16N_AVX2 + add r0, r1 + add r4, 32 + inc r6 + cmp r6, 16+7 + jnz .loopH + +; vertical phase + xor r6, r6 + xor r1, r1 +.loopV: + +;load necessary variables + mov r4d, r5d ;coeff here for vertical is r5m + shl r4d, 7 + mov r1d, 16 + add r1d, r1d + + ; load intermedia buffer + mov r0, stk_buf1 + + ; register mapping + ; r0 - src + ; r5 - coeff + ; r6 - loop_i + +; load coeff table +%ifdef PIC + lea r5, [pw_LumaCoeffVer] + add r5, r4 +%else + lea r5, [pw_LumaCoeffVer + r4] +%endif + + lea r4, [r1*3] + mova m14, [pd_526336] + lea r6, [r3 * 3] + mov r9d, 16 / 8 + +.loopW: + PROCESS_LUMA_AVX2_W8_16R sp + add r2, 8 + add r0, 16 + dec r9d + jnz .loopW + RET +%endif + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_12x32, 4, 6, 7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m6, [pw_512] + mova m1, [interp4_horiz_shuf1] + vpbroadcastd m2, [pw_1] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + mov r4d, 16 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + r1 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vpermq m3, m3, 11011000b + + vextracti128 xm4, m3, 1 + movq [r2], xm3 + pextrd [r2+8], xm3, 2 + movq [r2 + r3], xm4 + pextrd [r2 + r3 + 8],xm4, 2 + lea r2, [r2 + r3 * 2] + lea r0, [r0 + r1 * 2] + dec r4d + jnz .loop + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_24x64, 4,6,7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m1, [interp4_horiz_shuf1] + vpbroadcastd m2, [pw_1] + mova m6, [pw_512] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + mov r4d, 64 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 4] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + vbroadcasti128 m4, [r0 + 16] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + 20] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vpermq m3, m3, 11011000b + + vextracti128 xm4, m3, 1 + movu [r2], xm3 + movq [r2 + 16], xm4 + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET + + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_2x16, 4, 6, 6 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m4, [interp4_hpp_shuf] + mova m5, [pw_1] + dec r0 + lea r4, [r1 * 3] + movq xm1, [r0] + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m1, m1, xm2, 1 + lea r0, [r0 + r1 * 4] + movq xm3, [r0] + movhps xm3, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m3, m3, xm2, 1 + + pshufb m1, m4 + pshufb m3, m4 + pmaddubsw m1, m0 + pmaddubsw m3, m0 + pmaddwd m1, m5 + pmaddwd m3, m5 + packssdw m1, m3 + pmulhrsw m1, [pw_512] + vextracti128 xm2, m1, 1 + packuswb xm1, xm2 + + lea r4, [r3 * 3] + pextrw [r2], xm1, 0 + pextrw [r2 + r3], xm1, 1 + pextrw [r2 + r3 * 2], xm1, 4 + pextrw [r2 + r4], xm1, 5 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm1, 2 + pextrw [r2 + r3], xm1, 3 + pextrw [r2 + r3 * 2], xm1, 6 + pextrw [r2 + r4], xm1, 7 + lea r2, [r2 + r3 * 4] + lea r0, [r0 + r1 * 4] + + lea r4, [r1 * 3] + movq xm1, [r0] + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m1, m1, xm2, 1 + lea r0, [r0 + r1 * 4] + movq xm3, [r0] + movhps xm3, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m3, m3, xm2, 1 + + pshufb m1, m4 + pshufb m3, m4 + pmaddubsw m1, m0 + pmaddubsw m3, m0 + pmaddwd m1, m5 + pmaddwd m3, m5 + packssdw m1, m3 + pmulhrsw m1, [pw_512] + vextracti128 xm2, m1, 1 + packuswb xm1, xm2 + + lea r4, [r3 * 3] + pextrw [r2], xm1, 0 + pextrw [r2 + r3], xm1, 1 + pextrw [r2 + r3 * 2], xm1, 4 + pextrw [r2 + r4], xm1, 5 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm1, 2 + pextrw [r2 + r3], xm1, 3 + pextrw [r2 + r3 * 2], xm1, 6 + pextrw [r2 + r4], xm1, 7 + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_64xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_PP_64xN_AVX2 1 +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_64x%1, 4,6,7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m1, [interp4_horiz_shuf1] + vpbroadcastd m2, [pw_1] + mova m6, [pw_512] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + mov r4d, %1 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 4] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + vbroadcasti128 m4, [r0 + 16] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + 20] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + packuswb m3, m4 + vpermq m3, m3, 11011000b + movu [r2], m3 + + vbroadcasti128 m3, [r0 + 32] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 36] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + vbroadcasti128 m4, [r0 + 48] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + 52] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + packuswb m3, m4 + vpermq m3, m3, 11011000b + movu [r2 + 32], m3 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET +%endmacro + + IPFILTER_CHROMA_PP_64xN_AVX2 64 + IPFILTER_CHROMA_PP_64xN_AVX2 32 + IPFILTER_CHROMA_PP_64xN_AVX2 48 + IPFILTER_CHROMA_PP_64xN_AVX2 16 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_48x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_48x64, 4,6,7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m1, [interp4_horiz_shuf1] + vpbroadcastd m2, [pw_1] + mova m6, [pw_512] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + mov r4d, 64 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 4] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + vbroadcasti128 m4, [r0 + 16] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + 20] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vpermq m3, m3, q3120 + + movu [r2], m3 + + vbroadcasti128 m3, [r0 + mmsize] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + mmsize + 4] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + vbroadcasti128 m4, [r0 + mmsize + 16] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + mmsize + 20] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vpermq m3, m3, q3120 + movu [r2 + mmsize], xm3 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_48x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;-----------------------------------------------------------------------------------------------------------------------------; + +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_48x64, 4,7,6 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m5, [pw_2000] + mova m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + mov r6d, 64 + dec r0 + test r5d, r5d + je .loop + sub r0 , r1 + add r6d , 3 + +.loop + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, q3120 + movu [r2], m3 + + vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 24] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, q3120 + movu [r2 + 32], m3 + + vbroadcasti128 m3, [r0 + 32] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 40] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, q3120 + movu [r2 + 64], m3 + + add r2, r3 + add r0, r1 + dec r6d + jnz .loop + RET + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_24x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_24x64, 4,7,6 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m5, [pw_2000] + mova m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + mov r6d, 64 + dec r0 + test r5d, r5d + je .loop + sub r0 , r1 + add r6d , 3 + +.loop + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, q3120 + movu [r2], m3 + + vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + packssdw m3, m3 + psubw m3, m5 + vpermq m3, m3, q3120 + movu [r2 + 32], xm3 + + add r2, r3 + add r0, r1 + dec r6d + jnz .loop + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_2x16, 4, 7, 7 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + vbroadcasti128 m6, [pw_2000] + test r5d, r5d + jz .label + sub r0, r1 + +.label + mova m4, [interp4_hps_shuf] + mova m5, [pw_1] + dec r0 + lea r4, [r1 * 3] + movq xm1, [r0] ;row 0 + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m1, m1, xm2, 1 + lea r0, [r0 + r1 * 4] + movq xm3, [r0] + movhps xm3, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m3, m3, xm2, 1 + + pshufb m1, m4 + pshufb m3, m4 + pmaddubsw m1, m0 + pmaddubsw m3, m0 + pmaddwd m1, m5 + pmaddwd m3, m5 + packssdw m1, m3 + psubw m1, m6 + + lea r4, [r3 * 3] + vextracti128 xm2, m1, 1 + + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + movd [r2 + r3 * 2], xm2 + pextrd [r2 + r4], xm2, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm1, 2 + pextrd [r2 + r3], xm1, 3 + pextrd [r2 + r3 * 2], xm2, 2 + pextrd [r2 + r4], xm2, 3 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + lea r4, [r1 * 3] + movq xm1, [r0] + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m1, m1, xm2, 1 + lea r0, [r0 + r1 * 4] + movq xm3, [r0] + movhps xm3, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m3, m3, xm2, 1 + + pshufb m1, m4 + pshufb m3, m4 + pmaddubsw m1, m0 + pmaddubsw m3, m0 + pmaddwd m1, m5 + pmaddwd m3, m5 + packssdw m1, m3 + psubw m1, m6 + + lea r4, [r3 * 3] + vextracti128 xm2, m1, 1 + + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + movd [r2 + r3 * 2], xm2 + pextrd [r2 + r4], xm2, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm1, 2 + pextrd [r2 + r3], xm1, 3 + pextrd [r2 + r3 * 2], xm2, 2 + pextrd [r2 + r4], xm2, 3 + + test r5d, r5d + jz .end + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + movq xm1, [r0] + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + vinserti128 m1, m1, xm2, 1 + pshufb m1, m4 + pmaddubsw m1, m0 + pmaddwd m1, m5 + packssdw m1, m1 + psubw m1, m6 + vextracti128 xm2, m1, 1 + + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + movd [r2 + r3 * 2], xm2 +.end + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_6x16, 4, 6, 7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m1, [tab_Tm] + mova m2, [pw_1] + mova m6, [pw_512] + lea r4, [r1 * 3] + lea r5, [r3 * 3] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 +%rep 4 + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + ; Row 2 + vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + ; Row 3 + vbroadcasti128 m5, [r0 + r4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vextracti128 xm4, m3, 1 + movd [r2], xm3 + pextrw [r2 + 4], xm4, 0 + pextrd [r2 + r3], xm3, 1 + pextrw [r2 + r3 + 4], xm4, 2 + pextrd [r2 + r3 * 2], xm3, 2 + pextrw [r2 + r3 * 2 + 4], xm4, 4 + pextrd [r2 + r5], xm3, 3 + pextrw [r2 + r5 + 4], xm4, 6 + lea r2, [r2 + r3 * 4] + lea r0, [r0 + r1 * 4] +%endrep + RET
View file
x265_2.7.tar.gz/source/common/x86/loopfilter.asm -> x265_2.6.tar.gz/source/common/x86/loopfilter.asm
Changed
@@ -374,7 +374,7 @@ pxor m0, m0 ; m0 = 0 mova m6, [pb_2] ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] shr r4d, 4 -.loop: +.loop movu m7, [r0] movu m5, [r0 + 16] movu m3, [r0 + r3] @@ -430,7 +430,7 @@ mova m6, [pb_2] ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] mova m7, [pb_128] shr r4d, 4 -.loop: +.loop movu m1, [r0] ; m1 = pRec[x] movu m2, [r0 + r3] ; m2 = pRec[x + iStride] @@ -478,7 +478,7 @@ mova m4, [pb_2] shr r4d, 4 mova m0, [pw_pixel_max] -.loop: +.loop movu m5, [r0] movu m3, [r0 + r3] @@ -523,7 +523,7 @@ mova xm6, [pb_2] ; xm6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] mova xm7, [pb_128] shr r4d, 4 -.loop: +.loop movu xm1, [r0] ; xm1 = pRec[x] movu xm2, [r0 + r3] ; xm2 = pRec[x + iStride] @@ -572,7 +572,7 @@ mov r5d, r4d shr r4d, 4 mov r6, r0 -.loop: +.loop movu m7, [r0] movu m5, [r0 + 16] movu m3, [r0 + r3] @@ -674,7 +674,7 @@ pxor m0, m0 ; m0 = 0 mova m7, [pb_128] shr r4d, 4 -.loop: +.loop movu m1, [r0] ; m1 = pRec[x] movu m2, [r0 + r3] ; m2 = pRec[x + iStride] @@ -748,7 +748,7 @@ mova m4, [pw_pixel_max] vbroadcasti128 m6, [r2] ; m6 = m_iOffsetEo shr r4d, 4 -.loop: +.loop movu m7, [r0] movu m5, [r0 + r3] movu m1, [r0 + r3 * 2] @@ -804,7 +804,7 @@ vbroadcasti128 m5, [pb_128] vbroadcasti128 m6, [r2] ; m6 = m_iOffsetEo shr r4d, 4 -.loop: +.loop movu xm1, [r0] ; m1 = pRec[x] movu xm2, [r0 + r3] ; m2 = pRec[x + iStride] vinserti128 m1, m1, xm2, 1 @@ -859,7 +859,7 @@ movh m6, [r0 + r4 * 2] movhps m6, [r1 + r4] -.loop: +.loop movu m7, [r0] movu m5, [r0 + 16] movu m3, [r0 + r5 + 2] @@ -918,7 +918,7 @@ movh m5, [r0 + r4] movhps m5, [r1 + r4] -.loop: +.loop movu m1, [r0] ; m1 = rec[x] movu m2, [r0 + r5 + 1] ; m2 = rec[x + stride + 1] pxor m3, m1, m7 @@ -970,7 +970,7 @@ movhps xm4, [r1 + r4] vbroadcasti128 m5, [r3] mova m6, [pw_pixel_max] -.loop: +.loop movu m1, [r0] movu m3, [r0 + r5 + 2] @@ -1061,7 +1061,7 @@ movhps xm4, [r1 + r4] vbroadcasti128 m5, [r3] -.loop: +.loop movu m1, [r0] movu m7, [r0 + 32] movu m3, [r0 + r5 + 2] @@ -1567,11 +1567,11 @@ movu m4, [r1 + 16] ; offset[16-31] pxor m7, m7 -.loopH: +.loopH mov r5d, r2d xor r6, r6 -.loopW: +.loopW movu m2, [r0 + r6] movu m5, [r0 + r6 + 16] psrlw m0, m2, (BIT_DEPTH - 5) @@ -1617,11 +1617,11 @@ movu m3, [r1 + 0] ; offset[0-15] movu m4, [r1 + 16] ; offset[16-31] pxor m7, m7 ; m7 =[0] -.loopH: +.loopH mov r5d, r2d xor r6, r6 -.loopW: +.loopW movu m2, [r0 + r6] ; m0 = [rec] psrlw m1, m2, 3 pand m1, [pb_31] ; m1 = [index] @@ -1670,9 +1670,9 @@ mov r6d, r3d shr r3d, 1 -.loopH: +.loopH mov r5d, r2d -.loopW: +.loopW movu m2, [r0] movu m5, [r0 + r4] psrlw m0, m2, (BIT_DEPTH - 5) @@ -1751,9 +1751,9 @@ shr r2d, 4 mov r1d, r3d shr r3d, 1 -.loopH: +.loopH mov r5d, r2d -.loopW: +.loopW movu xm2, [r0] ; m2 = [rec] vinserti128 m2, m2, [r0 + r4], 1 psrlw m1, m2, 3 @@ -1789,7 +1789,7 @@ test r1b, 1 jz .end mov r5d, r2d -.loopW1: +.loopW1 movu xm2, [r0] ; m2 = [rec] psrlw xm1, xm2, 3 pand xm1, xm7 ; m1 = [index] @@ -1811,7 +1811,7 @@ add r0, 16 dec r5d jnz .loopW1 -.end: +.end RET %endif @@ -1827,7 +1827,7 @@ add r3d, 1 mov r5, r0 movu m4, [r0 + r4] -.loop: +.loop movu m1, [r1] ; m2 = pRec[x] movu m2, [r2] ; m3 = pTmpU[x] @@ -1921,7 +1921,7 @@ mov r5, r0 movu m4, [r0 + r4] -.loop: +.loop movu m1, [r1] ; m2 = pRec[x] movu m2, [r2] ; m3 = pTmpU[x]
View file
x265_2.7.tar.gz/source/common/x86/mc-a.asm -> x265_2.6.tar.gz/source/common/x86/mc-a.asm
Changed
@@ -4115,7 +4115,7 @@ lea r7, [r5 * 3] lea r8, [r1 * 3] mov r9d, 4 -.loop: +.loop pixel_avg_W8 dec r9d jnz .loop @@ -4129,7 +4129,7 @@ lea r7, [r5 * 3] lea r8, [r1 * 3] mov r9d, 8 -.loop: +.loop pixel_avg_W8 dec r9d jnz .loop @@ -4697,7 +4697,7 @@ lea r8, [r1 * 3] mov r9d, 4 -.loop: +.loop movu m0, [r2] movu m1, [r4] pavgw m0, m1 @@ -4834,7 +4834,7 @@ lea r7, [r5 * 3] lea r8, [r1 * 3] mov r9d, 4 -.loop: +.loop pixel_avg_H16 dec r9d jnz .loop @@ -4848,7 +4848,7 @@ lea r7, [r5 * 3] lea r8, [r1 * 3] mov r9d, 4 -.loop: +.loop pixel_avg_H16 pixel_avg_H16 dec r9d @@ -4863,7 +4863,7 @@ lea r7, [r5 * 3] lea r8, [r1 * 3] mov r9d, 4 -.loop: +.loop pixel_avg_H16 pixel_avg_H16 pixel_avg_H16 @@ -4887,7 +4887,7 @@ lea r8, [r1 * 3] mov r9d, 8 -.loop: +.loop movu m0, [r2] movu m1, [r4] pavgw m0, m1 @@ -4987,7 +4987,7 @@ lea r7, [r5 * 3] lea r8, [r1 * 3] mov r9d, 2 -.loop: +.loop pixel_avg_W32 dec r9d jnz .loop @@ -5001,7 +5001,7 @@ lea r7, [r5 * 3] lea r8, [r1 * 3] mov r9d, 4 -.loop: +.loop pixel_avg_W32 dec r9d jnz .loop @@ -5015,7 +5015,7 @@ lea r7, [r5 * 3] lea r8, [r1 * 3] mov r9d, 6 -.loop: +.loop pixel_avg_W32 dec r9d jnz .loop @@ -5029,7 +5029,7 @@ lea r7, [r5 * 3] lea r8, [r1 * 3] mov r9d, 8 -.loop: +.loop pixel_avg_W32 dec r9d jnz .loop @@ -5043,7 +5043,7 @@ lea r7, [r5 * 3] lea r8, [r1 * 3] mov r9d, 16 -.loop: +.loop pixel_avg_W32 dec r9d jnz .loop @@ -5141,7 +5141,7 @@ lea r7, [r5 * 3] lea r8, [r1 * 3] mov r9d, 4 -.loop: +.loop pixel_avg_W64 dec r9d jnz .loop @@ -5155,7 +5155,7 @@ lea r7, [r5 * 3] lea r8, [r1 * 3] mov r9d, 8 -.loop: +.loop pixel_avg_W64 dec r9d jnz .loop @@ -5169,7 +5169,7 @@ lea r7, [r5 * 3] lea r8, [r1 * 3] mov r9d, 12 -.loop: +.loop pixel_avg_W64 dec r9d jnz .loop @@ -5183,7 +5183,7 @@ lea r7, [r5 * 3] lea r8, [r1 * 3] mov r9d, 16 -.loop: +.loop pixel_avg_W64 dec r9d jnz .loop @@ -5204,7 +5204,7 @@ lea r8, [r1 * 3] mov r9d, 16 -.loop: +.loop movu m0, [r2] movu m1, [r4] pavgw m0, m1
View file
x265_2.7.tar.gz/source/common/x86/pixel-util8.asm -> x265_2.6.tar.gz/source/common/x86/pixel-util8.asm
Changed
@@ -1785,7 +1785,7 @@ movu [r1], xm7 je .nextH -.width6: +.width6 cmp r6d, 6 jl .width4 movq [r1], xm7 @@ -4937,7 +4937,7 @@ lea r9, [r4 * 3] lea r8, [r5 * 3] -.loop: +.loop pmovzxbw m0, [r2] pmovzxbw m1, [r3] pmovzxbw m2, [r2 + r4] @@ -5150,7 +5150,7 @@ lea r7, [r4 * 3] lea r8, [r5 * 3] -.loop: +.loop movu m0, [r2] movu m1, [r2 + 32] movu m2, [r3] @@ -5557,7 +5557,7 @@ lea r7, [r4 * 3] lea r8, [r5 * 3] -.loop: +.loop movu m0, [r2] movu m1, [r2 + 32] movu m2, [r2 + 64]
View file
x265_2.7.tar.gz/source/common/x86/sad-a.asm -> x265_2.6.tar.gz/source/common/x86/sad-a.asm
Changed
@@ -5631,7 +5631,7 @@ xorps m5, m5 mov r4d, 4 -.loop: +.loop movu m1, [r0] ; row 0 of pix0 movu m2, [r2] ; row 0 of pix1 movu m3, [r0 + r1] ; row 1 of pix0 @@ -5676,7 +5676,7 @@ mov r4d, 6 lea r5, [r1 * 3] lea r6, [r3 * 3] -.loop: +.loop movu m1, [r0] ; row 0 of pix0 movu m2, [r2] ; row 0 of pix1 movu m3, [r0 + r1] ; row 1 of pix0 @@ -5718,7 +5718,7 @@ lea r5, [r1 * 3] lea r6, [r3 * 3] -.loop: +.loop movu m1, [r0] ; row 0 of pix0 movu m2, [r2] ; row 0 of pix1 movu m3, [r0 + r1] ; row 1 of pix0 @@ -5759,7 +5759,7 @@ lea r5, [r1 * 3] lea r6, [r3 * 3] -.loop: +.loop movu m1, [r0] ; row 0 of pix0 movu m2, [r2] ; row 0 of pix1 movu m3, [r0 + r1] ; row 1 of pix0 @@ -5822,7 +5822,7 @@ mov r4d, 64/4 lea r5, [r1 * 3] lea r6, [r3 * 3] -.loop: +.loop movu m1, [r0] ; row 0 of pix0 movu m2, [r2] ; row 0 of pix1 movu m3, [r0 + r1] ; row 1 of pix0 @@ -5873,7 +5873,7 @@ xorps m0, m0 xorps m5, m5 mov r4d, 4 -.loop: +.loop movu m1, [r0] ; first 32 of row 0 of pix0 movu m2, [r2] ; first 32 of row 0 of pix1 movu m3, [r0 + 32] ; second 32 of row 0 of pix0 @@ -5936,7 +5936,7 @@ xorps m0, m0 xorps m5, m5 mov r4d, 16 -.loop: +.loop movu m1, [r0] ; first 32 of row 0 of pix0 movu m2, [r2] ; first 32 of row 0 of pix1 movu m3, [r0 + 32] ; second 32 of row 0 of pix0 @@ -5978,7 +5978,7 @@ mov r4d, 12 lea r5, [r1 * 3] lea r6, [r3 * 3] -.loop: +.loop movu m1, [r0] ; first 32 of row 0 of pix0 movu m2, [r2] ; first 32 of row 0 of pix1 movu m3, [r0 + 32] ; second 32 of row 0 of pix0 @@ -6040,7 +6040,7 @@ mov r4d, 8 lea r5, [r1 * 3] lea r6, [r3 * 3] -.loop: +.loop movu m1, [r0] ; first 32 of row 0 of pix0 movu m2, [r2] ; first 32 of row 0 of pix1 movu m3, [r0 + 32] ; second 32 of row 0 of pix0
View file
x265_2.7.tar.gz/source/common/x86/seaintegral.asm -> x265_2.6.tar.gz/source/common/x86/seaintegral.asm
Changed
@@ -36,7 +36,7 @@ mov r2, r1 shl r2, 4 -.loop: +.loop movu m0, [r0] movu m1, [r0 + r2] psubd m1, m0 @@ -54,7 +54,7 @@ mov r2, r1 shl r2, 5 -.loop: +.loop movu m0, [r0] movu m1, [r0 + r2] psubd m1, m0 @@ -75,7 +75,7 @@ shl r3, 4 add r2, r3 -.loop: +.loop movu m0, [r0] movu m1, [r0 + r2] psubd m1, m0 @@ -93,7 +93,7 @@ mov r2, r1 shl r2, 6 -.loop: +.loop movu m0, [r0] movu m1, [r0 + r2] psubd m1, m0 @@ -114,7 +114,7 @@ shl r3, 5 add r2, r3 -.loop: +.loop movu m0, [r0] movu m1, [r0 + r2] psubd m1, m0 @@ -132,7 +132,7 @@ mov r2, r1 shl r2, 7 -.loop: +.loop movu m0, [r0] movu m1, [r0 + r2] psubd m1, m0 @@ -264,7 +264,7 @@ movu [r0 + r3], xm0 jmp .end -.end: +.end RET %endif @@ -379,7 +379,7 @@ movu [r0 + r3], m0 jmp .end -.end: +.end RET %endif @@ -577,7 +577,7 @@ movu [r0 + r3], xm0 jmp .end -.end: +.end RET %endif @@ -740,7 +740,7 @@ movu [r0 + r3], m0 jmp .end -.end: +.end RET %endif @@ -883,7 +883,7 @@ movu [r0 + r3], m0 jmp .end -.end: +.end RET %macro INTEGRAL_THIRTYTWO_HORIZONTAL_16 0 @@ -1058,5 +1058,5 @@ movu [r0 + r3], m0 jmp .end -.end: +.end RET
View file
x265_2.7.tar.gz/source/common/x86/x86inc.asm -> x265_2.6.tar.gz/source/common/x86/x86inc.asm
Changed
@@ -66,15 +66,6 @@ %endif %endif -%define FORMAT_ELF 0 -%ifidn __OUTPUT_FORMAT__,elf - %define FORMAT_ELF 1 -%elifidn __OUTPUT_FORMAT__,elf32 - %define FORMAT_ELF 1 -%elifidn __OUTPUT_FORMAT__,elf64 - %define FORMAT_ELF 1 -%endif - %ifdef PREFIX %define mangle(x) _ %+ x %else @@ -97,10 +88,6 @@ default rel %endif -%ifdef __NASM_VER__ - %use smartalign -%endif - ; Macros to eliminate most code duplication between x86_32 and x86_64: ; Currently this works only for leaf functions which load all their arguments ; into registers at the start, and make no other use of the stack. Luckily that @@ -698,7 +685,7 @@ CAT_XDEFINE cglobaled_, %2, 1 %endif %xdefine current_function %2 - %if FORMAT_ELF + %ifidn __OUTPUT_FORMAT__,elf global %2:function %%VISIBILITY %else global %2 @@ -724,16 +711,14 @@ ; like cextern, but without the prefix %macro cextern_naked 1 - %ifdef PREFIX - %xdefine %1 mangle(%1) - %endif + %xdefine %1 mangle(%1) CAT_XDEFINE cglobaled_, %1, 1 extern %1 %endmacro %macro const 1-2+ %xdefine %1 mangle(private_prefix %+ _ %+ %1) - %if FORMAT_ELF + %ifidn __OUTPUT_FORMAT__,elf global %1:data hidden %else global %1 @@ -742,8 +727,9 @@ %1: %2 %endmacro -; This is needed for ELF, otherwise the GNU linker assumes the stack is executable by default. -%if FORMAT_ELF +; This is needed for ELF, otherwise the GNU linker assumes the stack is +; executable by default. +%ifidn __OUTPUT_FORMAT__,elf [SECTION .note.GNU-stack noalloc noexec nowrite progbits] %endif @@ -815,17 +801,9 @@ %endif %if ARCH_X86_64 || cpuflag(sse2) - %ifdef __NASM_VER__ - ALIGNMODE p6 - %else - CPU amdnop - %endif + CPU amdnop %else - %ifdef __NASM_VER__ - ALIGNMODE nop - %else - CPU basicnop - %endif + CPU basicnop %endif %endmacro @@ -1489,7 +1467,7 @@ v%5%6 %1, %2, %3, %4 %elifidn %1, %2 ; If %3 or %4 is a memory operand it needs to be encoded as the last operand. - %ifnum sizeof%3 + %ifid %3 v%{5}213%6 %2, %3, %4 %else v%{5}132%6 %2, %4, %3 @@ -1513,3 +1491,14 @@ FMA4_INSTR fmsubadd, pd, ps FMA4_INSTR fnmadd, pd, ps, sd, ss FMA4_INSTR fnmsub, pd, ps, sd, ss + +; workaround: vpbroadcastq is broken in x86_32 due to a yasm bug (fixed in 1.3.0) +%if __YASM_VERSION_ID__ < 0x01030000 && ARCH_X86_64 == 0 + %macro vpbroadcastq 2 + %if sizeof%1 == 16 + movddup %1, %2 + %else + vbroadcastsd %1, %2 + %endif + %endmacro +%endif
View file
x265_2.7.tar.gz/source/dynamicHDR10/JsonHelper.cpp -> x265_2.6.tar.gz/source/dynamicHDR10/JsonHelper.cpp
Changed
@@ -139,13 +139,21 @@ return JsonObject(); } - std::ifstream ifs(path); - const std::string json_str2((std::istreambuf_iterator<char>(ifs)), - (std::istreambuf_iterator<char>())); - + ifstream tfile; + string json_str; + string json_str2; string err = ""; + tfile.open(path); + while(tfile) + { + std::getline(tfile, json_str); + json_str2.append(json_str); + } - return Json::parse(json_str2,err, JsonParse::COMMENTS).object_items(); + tfile.close(); + size_t beginning = json_str2.find_first_of("{"); + int fixchar = json_str2[json_str2.size() - 2] == '}' ? 1 : 0; + return Json::parse(json_str2.substr(beginning,json_str2.size() - fixchar),err).object_items(); } JsonArray JsonHelper::readJsonArray(const string &path) @@ -166,13 +174,28 @@ return JsonArray(); } - std::ifstream ifs(path); - const std::string json_str2((std::istreambuf_iterator<char>(ifs)), - (std::istreambuf_iterator<char>())); - + ifstream tfile; + string json_str; + string json_str2; string err = ""; + tfile.open(path); + while(tfile) + { + std::getline(tfile, json_str); + json_str2.append(json_str); + } + + tfile.close(); - return Json::parse(json_str2,err, JsonParse::COMMENTS).array_items(); + vector<Json> data; + if (json_str2.size() != 0) + { + size_t beginning = json_str2.find_first_of("["); + int fixchar = json_str2[json_str2.size() - 2] == ']' ? 1 : 0; + return Json::parse(json_str2.substr(beginning, json_str2.size() - fixchar), err).array_items(); + } + else + return data; } bool JsonHelper::validatePathExtension(string &path)
View file
x265_2.7.tar.gz/source/dynamicHDR10/SeiMetadataDictionary.cpp -> x265_2.6.tar.gz/source/dynamicHDR10/SeiMetadataDictionary.cpp
Changed
@@ -28,7 +28,6 @@ const std::string JsonDataKeys::LocalParameters = std::string("LocalParameters"); const std::string JsonDataKeys::TargetDisplayLuminance = std::string("TargetedSystemDisplayMaximumLuminance"); -const std::string JsonDataKeys::NumberOfWindows = std::string("NumberOfWindows"); const std::string BezierCurveNames::TagName = std::string("BezierCurveData"); const std::string BezierCurveNames::NumberOfAnchors = std::string("NumberOfAnchors");
View file
x265_2.7.tar.gz/source/dynamicHDR10/SeiMetadataDictionary.h -> x265_2.6.tar.gz/source/dynamicHDR10/SeiMetadataDictionary.h
Changed
@@ -37,7 +37,6 @@ public: static const std::string LocalParameters; static const std::string TargetDisplayLuminance; - static const std::string NumberOfWindows; }; //Bezier Curve Data
View file
x265_2.7.tar.gz/source/dynamicHDR10/metadataFromJson.cpp -> x265_2.6.tar.gz/source/dynamicHDR10/metadataFromJson.cpp
Changed
@@ -372,7 +372,7 @@ const uint16_t terminalProviderCode = 0x003C; const uint16_t terminalProviderOrientedCode = 0x0001; const uint8_t applicationIdentifier = 4; - const uint8_t applicationVersion = 1; + const uint8_t applicationVersion = 0; mPimpl->appendBits(metadata, countryCode, 8); mPimpl->appendBits(metadata, terminalProviderCode, 16); @@ -384,7 +384,9 @@ //Note: Validated only add up to two local selections, ignore the rest JsonArray jsonArray = fileData[frame][JsonDataKeys::LocalParameters].array_items(); int ellipsesNum = static_cast<int>(jsonArray.size() > 2 ? 2 : jsonArray.size()); - uint16_t numWindows = (uint16_t)fileData[frame][JsonDataKeys::NumberOfWindows].int_value(); + + uint16_t numWindows = 1 + static_cast<uint16_t>(ellipsesNum); + mPimpl->appendBits(metadata, numWindows, 2); for (int i = 0; i < ellipsesNum; ++i) { @@ -424,15 +426,16 @@ mPimpl->appendBits(metadata, semimajorExternalAxis, 16); mPimpl->appendBits(metadata, semiminorExternalAxis, 16); - uint8_t overlapProcessOption = static_cast<uint8_t>(ellipseJsonObject[EllipseNames::OverlapProcessOption].int_value()); + /*bool*/ uint8_t overlapProcessOption = static_cast<uint8_t>(ellipseJsonObject[EllipseNames::OverlapProcessOption].int_value()); //1; //TODO: Uses Layering method, the value is "1" mPimpl->appendBits(metadata, overlapProcessOption, 1); } /* Targeted System Display Data */ - uint32_t monitorPeak = fileData[frame][JsonDataKeys::TargetDisplayLuminance].int_value(); //500; - mPimpl->appendBits(metadata, monitorPeak, 27); + uint32_t TEMPmonitorPeak = fileData[frame][JsonDataKeys::TargetDisplayLuminance].int_value(); //500; + mPimpl->appendBits(metadata, TEMPmonitorPeak, 27); + //NOTE: Set as false for now, as requested - uint8_t targetedSystemDisplayActualPeakLuminanceFlag = 0; + /*bool*/uint8_t targetedSystemDisplayActualPeakLuminanceFlag = 0; /*false*/ mPimpl->appendBits(metadata, targetedSystemDisplayActualPeakLuminanceFlag, 1); if (targetedSystemDisplayActualPeakLuminanceFlag) { @@ -460,6 +463,7 @@ mPimpl->appendBits(metadata, static_cast<uint16_t>((int)luminanceData.maxGLuminance & 0xFFFF), 16); mPimpl->appendBits(metadata, static_cast<uint8_t>(((int)luminanceData.maxBLuminance & 0x10000) >> 16), 1); mPimpl->appendBits(metadata, static_cast<uint16_t>((int)luminanceData.maxBLuminance & 0xFFFF), 16); + /* changed from maxRGBLuminance to average luminance to match stms implementation */ mPimpl->appendBits(metadata, static_cast<uint8_t>(((int)luminanceData.averageLuminance & 0x10000) >> 16), 1); mPimpl->appendBits(metadata, static_cast<uint16_t>((int)luminanceData.averageLuminance & 0xFFFF), 16); @@ -474,7 +478,7 @@ uint8_t distributionMaxrgbPercentage = static_cast<uint8_t>(percentilPercentages.at(i)); mPimpl->appendBits(metadata, distributionMaxrgbPercentage, 7); - /* 17bits: 1bit then 16 */ + // 17bits: 1bit then 16 unsigned int ithPercentile = luminanceData.percentiles.at(i); uint8_t highValue = static_cast<uint8_t>((ithPercentile & 0x10000) >> 16); uint16_t lowValue = static_cast<uint16_t>(ithPercentile & 0xFFFF); @@ -495,32 +499,33 @@ { //TODO } - /* Bezier Curve Data */ + // BEZIER CURVE DATA for (int w = 0; w < numWindows; ++w) { + //TODO: uint8_t toneMappingFlag = 1; - /* Check if the window contains tone mapping bezier curve data and set toneMappingFlag appropriately */ - //Json bezierData = fileData[frame][BezierCurveNames::TagName]; - BezierCurveData curveData; - /* Select curve data based on global window */ - if (w == 0) + mPimpl->appendBits(metadata, toneMappingFlag, 1); + if (toneMappingFlag) { - if (!mPimpl->bezierCurveFromJson(fileData[frame][BezierCurveNames::TagName], curveData)) + Json bezierData = fileData[frame][BezierCurveNames::TagName]; + BezierCurveData curveData; + + /* Select curve data based on global window or local window */ + if (w == 0) { - toneMappingFlag = 0; + if (!mPimpl->bezierCurveFromJson(bezierData, curveData)) + { + std::cout << "error parsing bezierCurve frame: " << w << std::endl; + } } - } - /* Select curve data based on local window */ - else - { - if (!mPimpl->bezierCurveFromJson(jsonArray[w - 1][BezierCurveNames::TagName], curveData)) + else { - toneMappingFlag = 0; + if (!mPimpl->bezierCurveFromJson(jsonArray[w - 1][BezierCurveNames::TagName], curveData)) + { + std::cout << "error parsing bezierCurve ellipse: " << w - 1 << std::endl; + } } - } - mPimpl->appendBits(metadata, toneMappingFlag, 1); - if (toneMappingFlag) - { + uint16_t kneePointX = static_cast<uint16_t>(curveData.sPx); mPimpl->appendBits(metadata, kneePointX, 12); uint16_t kneePointY = static_cast<uint16_t>(curveData.sPy); @@ -536,7 +541,7 @@ mPimpl->appendBits(metadata, anchor, 10); } } - } + } /* Set to false as requested */ bool colorSaturationMappingFlag = 0; mPimpl->appendBits(metadata, colorSaturationMappingFlag, 1);
View file
x265_2.7.tar.gz/source/encoder/analysis.cpp -> x265_2.6.tar.gz/source/encoder/analysis.cpp
Changed
@@ -100,17 +100,16 @@ for (uint32_t depth = 0; depth <= m_param->maxCUDepth; depth++, cuSize >>= 1) { ModeDepth &md = m_modeDepth[depth]; - ok &= md.cuMemPool.create(depth, csp, MAX_PRED_TYPES, *m_param); + + md.cuMemPool.create(depth, csp, MAX_PRED_TYPES, *m_param); ok &= md.fencYuv.create(cuSize, csp); - if (ok) + + for (int j = 0; j < MAX_PRED_TYPES; j++) { - for (int j = 0; j < MAX_PRED_TYPES; j++) - { - md.pred[j].cu.initialize(md.cuMemPool, depth, *m_param, j); - ok &= md.pred[j].predYuv.create(cuSize, csp); - ok &= md.pred[j].reconYuv.create(cuSize, csp); - md.pred[j].fencYuv = &md.fencYuv; - } + md.pred[j].cu.initialize(md.cuMemPool, depth, *m_param, j); + ok &= md.pred[j].predYuv.create(cuSize, csp); + ok &= md.pred[j].reconYuv.create(cuSize, csp); + md.pred[j].fencYuv = &md.fencYuv; } } if (m_param->sourceHeight >= 1080) @@ -159,34 +158,38 @@ if (m_param->bCTUInfo && (*m_frame->m_ctuInfo + ctu.m_cuAddr)) { x265_ctu_info_t* ctuTemp = *m_frame->m_ctuInfo + ctu.m_cuAddr; - int32_t depthIdx = 0; - uint32_t maxNum8x8Partitions = 64; - uint8_t* depthInfoPtr = m_frame->m_addOnDepth[ctu.m_cuAddr]; - uint8_t* contentInfoPtr = m_frame->m_addOnCtuInfo[ctu.m_cuAddr]; - int* prevCtuInfoChangePtr = m_frame->m_addOnPrevChange[ctu.m_cuAddr]; - do - { - uint8_t depth = (uint8_t)ctuTemp->ctuPartitions[depthIdx]; - uint8_t content = (uint8_t)(*((int32_t *)ctuTemp->ctuInfo + depthIdx)); - int prevCtuInfoChange = m_frame->m_prevCtuInfoChange[ctu.m_cuAddr * maxNum8x8Partitions + depthIdx]; - memset(depthInfoPtr, depth, sizeof(uint8_t) * numPartition >> 2 * depth); - memset(contentInfoPtr, content, sizeof(uint8_t) * numPartition >> 2 * depth); - memset(prevCtuInfoChangePtr, 0, sizeof(int) * numPartition >> 2 * depth); - for (uint32_t l = 0; l < numPartition >> 2 * depth; l++) - prevCtuInfoChangePtr[l] = prevCtuInfoChange; - depthInfoPtr += ctu.m_numPartitions >> 2 * depth; - contentInfoPtr += ctu.m_numPartitions >> 2 * depth; - prevCtuInfoChangePtr += ctu.m_numPartitions >> 2 * depth; - depthIdx++; - } while (ctuTemp->ctuPartitions[depthIdx] != 0); - - m_additionalCtuInfo = m_frame->m_addOnCtuInfo[ctu.m_cuAddr]; - m_prevCtuInfoChange = m_frame->m_addOnPrevChange[ctu.m_cuAddr]; - memcpy(ctu.m_cuDepth, m_frame->m_addOnDepth[ctu.m_cuAddr], sizeof(uint8_t) * numPartition); - //Calculate log2CUSize from depth - for (uint32_t i = 0; i < cuGeom.numPartitions; i++) - ctu.m_log2CUSize[i] = (uint8_t)m_param->maxLog2CUSize - ctu.m_cuDepth[i]; + if (ctuTemp->ctuPartitions) + { + int32_t depthIdx = 0; + uint32_t maxNum8x8Partitions = 64; + uint8_t* depthInfoPtr = m_frame->m_addOnDepth[ctu.m_cuAddr]; + uint8_t* contentInfoPtr = m_frame->m_addOnCtuInfo[ctu.m_cuAddr]; + int* prevCtuInfoChangePtr = m_frame->m_addOnPrevChange[ctu.m_cuAddr]; + do + { + uint8_t depth = (uint8_t)ctuTemp->ctuPartitions[depthIdx]; + uint8_t content = (uint8_t)(*((int32_t *)ctuTemp->ctuInfo + depthIdx)); + int prevCtuInfoChange = m_frame->m_prevCtuInfoChange[ctu.m_cuAddr * maxNum8x8Partitions + depthIdx]; + memset(depthInfoPtr, depth, sizeof(uint8_t) * numPartition >> 2 * depth); + memset(contentInfoPtr, content, sizeof(uint8_t) * numPartition >> 2 * depth); + memset(prevCtuInfoChangePtr, 0, sizeof(int) * numPartition >> 2 * depth); + for (uint32_t l = 0; l < numPartition >> 2 * depth; l++) + prevCtuInfoChangePtr[l] = prevCtuInfoChange; + depthInfoPtr += ctu.m_numPartitions >> 2 * depth; + contentInfoPtr += ctu.m_numPartitions >> 2 * depth; + prevCtuInfoChangePtr += ctu.m_numPartitions >> 2 * depth; + depthIdx++; + } while (ctuTemp->ctuPartitions[depthIdx] != 0); + + m_additionalCtuInfo = m_frame->m_addOnCtuInfo[ctu.m_cuAddr]; + m_prevCtuInfoChange = m_frame->m_addOnPrevChange[ctu.m_cuAddr]; + memcpy(ctu.m_cuDepth, m_frame->m_addOnDepth[ctu.m_cuAddr], sizeof(uint8_t) * numPartition); + //Calculate log2CUSize from depth + for (uint32_t i = 0; i < cuGeom.numPartitions; i++) + ctu.m_log2CUSize[i] = (uint8_t)m_param->maxLog2CUSize - ctu.m_cuDepth[i]; + } } + if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead) { m_multipassAnalysis = (analysis2PassFrameData*)m_frame->m_analysis2Pass.analysisFramedata; @@ -204,11 +207,11 @@ } } - if ((m_param->analysisSave || m_param->analysisLoad) && m_slice->m_sliceType != I_SLICE && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel < 10) + if (m_param->analysisReuseMode && m_slice->m_sliceType != I_SLICE && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel < 10) { int numPredDir = m_slice->isInterP() ? 1 : 2; m_reuseInterDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData; - m_reuseRef = &m_reuseInterDataCTU->ref [ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir]; + m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir]; m_reuseDepth = &m_reuseInterDataCTU->depth[ctu.m_cuAddr * ctu.m_numPartitions]; m_reuseModes = &m_reuseInterDataCTU->modes[ctu.m_cuAddr * ctu.m_numPartitions]; if (m_param->analysisReuseLevel > 4) @@ -216,7 +219,7 @@ m_reusePartSize = &m_reuseInterDataCTU->partSize[ctu.m_cuAddr * ctu.m_numPartitions]; m_reuseMergeFlag = &m_reuseInterDataCTU->mergeFlag[ctu.m_cuAddr * ctu.m_numPartitions]; } - if (m_param->analysisSave && !m_param->analysisLoad) + if (m_param->analysisReuseMode == X265_ANALYSIS_SAVE) for (int i = 0; i < X265_MAX_PRED_MODE_PER_CTU * numPredDir; i++) m_reuseRef[i] = -1; } @@ -225,7 +228,7 @@ if (m_slice->m_sliceType == I_SLICE) { analysis_intra_data* intraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData; - if (m_param->analysisLoad && m_param->analysisReuseLevel > 1) + if (m_param->analysisReuseMode == X265_ANALYSIS_LOAD && m_param->analysisReuseLevel > 1) { memcpy(ctu.m_cuDepth, &intraDataCTU->depth[ctu.m_cuAddr * numPartition], sizeof(uint8_t) * numPartition); memcpy(ctu.m_lumaIntraDir, &intraDataCTU->modes[ctu.m_cuAddr * numPartition], sizeof(uint8_t) * numPartition); @@ -236,7 +239,7 @@ } else { - bool bCopyAnalysis = ((m_param->analysisLoad && m_param->analysisReuseLevel == 10) || (m_param->bMVType && m_param->analysisReuseLevel >= 7 && ctu.m_numPartitions <= 16)); + bool bCopyAnalysis = ((m_param->analysisReuseMode == X265_ANALYSIS_LOAD && m_param->analysisReuseLevel == 10) || (m_param->bMVType && m_param->analysisReuseLevel >= 7 && ctu.m_numPartitions <= 16)); bool BCompressInterCUrd0_4 = (m_param->bMVType && m_param->analysisReuseLevel >= 7 && m_param->rdLevel <= 4); bool BCompressInterCUrd5_6 = (m_param->bMVType && m_param->analysisReuseLevel >= 7 && m_param->rdLevel >= 5 && m_param->rdLevel <= 6); bCopyAnalysis = bCopyAnalysis || BCompressInterCUrd0_4 || BCompressInterCUrd5_6; @@ -277,7 +280,7 @@ /* generate residual for entire CTU at once and copy to reconPic */ encodeResidue(ctu, cuGeom); } - else if ((m_param->analysisLoad && m_param->analysisReuseLevel == 10) || ((m_param->bMVType == AVC_INFO) && m_param->analysisReuseLevel >= 7 && ctu.m_numPartitions <= 16)) + else if ((m_param->analysisReuseMode == X265_ANALYSIS_LOAD && m_param->analysisReuseLevel == 10) || ((m_param->bMVType == AVC_INFO) && m_param->analysisReuseLevel >= 7)) { analysis_inter_data* interDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData; int posCTU = ctu.m_cuAddr * numPartition; @@ -456,9 +459,11 @@ int bestCUQP = qp; int lambdaQP = lqp; + bool doQPRefine = (bDecidedDepth && depth <= m_slice->m_pps->maxCuDQPDepth) || (!bDecidedDepth && depth == m_slice->m_pps->maxCuDQPDepth); - if (m_param->analysisReuseLevel >= 7) + if (m_param->analysisReuseLevel == 10) doQPRefine = false; + if (doQPRefine) { uint64_t bestCUCost, origCUCost, cuCost, cuPrevCost; @@ -647,12 +652,13 @@ cacheCost[cuIdx] = md.bestMode->rdCost; } - if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4) + /* Save Intra CUs TU depth only when analysis mode is OFF */ + if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4 && !m_param->analysisReuseMode) { CUData* ctu = md.bestMode->cu.m_encData->getPicCTU(parentCTU.m_cuAddr); int8_t maxTUDepth = -1; for (uint32_t i = 0; i < cuGeom.numPartitions; i++) - maxTUDepth = X265_MAX(maxTUDepth, md.bestMode->cu.m_tuDepth[i]); + maxTUDepth = X265_MAX(maxTUDepth, md.pred[PRED_INTRA].cu.m_tuDepth[i]); ctu->m_refTuDepth[cuGeom.geomRecurId] = maxTUDepth; } @@ -1259,7 +1265,7 @@ mightSplit &= !bDecidedDepth; } } - if ((m_param->analysisLoad && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel != 10)) + if ((m_param->analysisReuseMode == X265_ANALYSIS_LOAD && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel != 10)) { if (mightNotSplit && depth == m_reuseDepth[cuGeom.absPartIdx]) { @@ -1299,8 +1305,9 @@ } } } + /* Step 1. Evaluate Merge/Skip candidates for likely early-outs, if skip mode was not set above */ - if ((mightNotSplit && depth >= minDepth && !md.bestMode && !bCtuInfoCheck) || (m_param->bMVType && m_param->analysisReuseLevel == 7 && (m_modeFlag[0] || m_modeFlag[1]))) /* TODO: Re-evaluate if analysis load/save still works */ + if ((mightNotSplit && depth >= minDepth && !md.bestMode && !bCtuInfoCheck) || (m_param->bMVType && (m_modeFlag[0] || m_modeFlag[1]))) /* TODO: Re-evaluate if analysis load/save still works */ { /* Compute Merge Cost */ md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); @@ -1310,7 +1317,8 @@ skipModes = (m_param->bEnableEarlySkip || m_param->interRefine == 2) && md.bestMode && md.bestMode->cu.isSkipped(0); // TODO: sa8d threshold per depth } - if (md.bestMode && m_param->bEnableRecursionSkip && !bCtuInfoCheck && !(m_param->bMVType && m_param->analysisReuseLevel == 7 && (m_modeFlag[0] || m_modeFlag[1]))) + + if (md.bestMode && m_param->bEnableRecursionSkip && !bCtuInfoCheck && !(m_param->bMVType && (m_modeFlag[0] || m_modeFlag[1]))) { skipRecursion = md.bestMode->cu.isSkipped(0); if (mightSplit && depth >= minDepth && !skipRecursion) @@ -1321,8 +1329,10 @@ skipRecursion = complexityCheckCU(*md.bestMode); } } - if (m_param->bMVType && md.bestMode && cuGeom.numPartitions <= 16 && m_param->analysisReuseLevel == 7) + + if (m_param->bMVType && md.bestMode && cuGeom.numPartitions <= 16) skipRecursion = true; + /* Step 2. Evaluate each of the 4 split sub-blocks in series */ if (mightSplit && !skipRecursion) { @@ -1377,20 +1387,11 @@ else splitPred->sa8dCost = m_rdCost.calcRdSADCost((uint32_t)splitPred->distortion, splitPred->sa8dBits); } + /* If analysis mode is simple do not Evaluate other modes */ - if (m_param->bMVType && m_param->analysisReuseLevel == 7) - { - if (m_slice->m_sliceType == P_SLICE) - { - if (m_checkMergeAndSkipOnly[0]) - skipModes = true; - } - else - { - if (m_checkMergeAndSkipOnly[0] && m_checkMergeAndSkipOnly[1]) - skipModes = true; - } - } + if ((m_param->bMVType && cuGeom.numPartitions <= 16) && (m_slice->m_sliceType == P_SLICE || m_slice->m_sliceType == B_SLICE)) + mightNotSplit = !(m_checkMergeAndSkipOnly[0] || (m_checkMergeAndSkipOnly[0] && m_checkMergeAndSkipOnly[1])); + /* Split CUs * 0 1 * 2 3 */ @@ -1953,7 +1954,7 @@ mightSplit &= !bDecidedDepth; } } - if (m_param->analysisLoad && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel != 10) + if (m_param->analysisReuseMode == X265_ANALYSIS_LOAD && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel != 10) { if (mightNotSplit && depth == m_reuseDepth[cuGeom.absPartIdx]) { @@ -1997,9 +1998,10 @@ } } } + /* Step 1. Evaluate Merge/Skip candidates for likely early-outs */ if ((mightNotSplit && !md.bestMode && !bCtuInfoCheck) || - (m_param->bMVType && m_param->analysisReuseLevel == 7 && (m_modeFlag[0] || m_modeFlag[1]))) + (m_param->bMVType && (m_modeFlag[0] || m_modeFlag[1]))) { md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); @@ -2014,8 +2016,10 @@ if (m_param->bEnableRecursionSkip && depth && m_modeDepth[depth - 1].bestMode) skipRecursion = md.bestMode && !md.bestMode->cu.getQtRootCbf(0); } - if (m_param->bMVType && md.bestMode && cuGeom.numPartitions <= 16 && m_param->analysisReuseLevel == 7) + + if (m_param->bMVType && md.bestMode && cuGeom.numPartitions <= 16) skipRecursion = true; + // estimate split cost /* Step 2. Evaluate each of the 4 split sub-blocks in series */ if (mightSplit && !skipRecursion) @@ -2067,20 +2071,11 @@ checkDQPForSplitPred(*splitPred, cuGeom); } + /* If analysis mode is simple do not Evaluate other modes */ - if (m_param->bMVType && m_param->analysisReuseLevel == 7) - { - if (m_slice->m_sliceType == P_SLICE) - { - if (m_checkMergeAndSkipOnly[0]) - skipModes = true; - } - else - { - if (m_checkMergeAndSkipOnly[0] && m_checkMergeAndSkipOnly[1]) - skipModes = true; - } - } + if ((m_param->bMVType && cuGeom.numPartitions <= 16) && (m_slice->m_sliceType == P_SLICE || m_slice->m_sliceType == B_SLICE)) + mightNotSplit = !(m_checkMergeAndSkipOnly[0] || (m_checkMergeAndSkipOnly[0] && m_checkMergeAndSkipOnly[1])); + /* Split CUs * 0 1 * 2 3 */ @@ -2886,7 +2881,7 @@ interMode.cu.setPredModeSubParts(MODE_INTER); int numPredDir = m_slice->isInterP() ? 1 : 2; - if (m_param->analysisLoad && m_reuseInterDataCTU && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel != 10) + if (m_param->analysisReuseMode == X265_ANALYSIS_LOAD && m_reuseInterDataCTU && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel != 10) { int refOffset = cuGeom.geomRecurId * 16 * numPredDir + partSize * numPredDir * 2; int index = 0; @@ -2928,7 +2923,7 @@ } interMode.sa8dCost = m_rdCost.calcRdSADCost((uint32_t)interMode.distortion, interMode.sa8dBits); - if (m_param->analysisSave && m_reuseInterDataCTU && m_param->analysisReuseLevel > 1) + if (m_param->analysisReuseMode == X265_ANALYSIS_SAVE && m_reuseInterDataCTU && m_param->analysisReuseLevel > 1) { int refOffset = cuGeom.geomRecurId * 16 * numPredDir + partSize * numPredDir * 2; int index = 0; @@ -2950,7 +2945,7 @@ interMode.cu.setPredModeSubParts(MODE_INTER); int numPredDir = m_slice->isInterP() ? 1 : 2; - if (m_param->analysisLoad && m_reuseInterDataCTU && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel != 10) + if (m_param->analysisReuseMode == X265_ANALYSIS_LOAD && m_reuseInterDataCTU && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel != 10) { int refOffset = cuGeom.geomRecurId * 16 * numPredDir + partSize * numPredDir * 2; int index = 0; @@ -2984,7 +2979,7 @@ /* predInterSearch sets interMode.sa8dBits, but this is ignored */ encodeResAndCalcRdInterCU(interMode, cuGeom); - if (m_param->analysisSave && m_reuseInterDataCTU && m_param->analysisReuseLevel > 1) + if (m_param->analysisReuseMode == X265_ANALYSIS_SAVE && m_reuseInterDataCTU && m_param->analysisReuseLevel > 1) { int refOffset = cuGeom.geomRecurId * 16 * numPredDir + partSize * numPredDir * 2; int index = 0;
View file
x265_2.7.tar.gz/source/encoder/api.cpp -> x265_2.6.tar.gz/source/encoder/api.cpp
Changed
@@ -67,7 +67,9 @@ "Y PSNR, U PSNR, V PSNR, Global PSNR, SSIM, SSIM (dB), " "I count, I ave-QP, I kbps, I-PSNR Y, I-PSNR U, I-PSNR V, I-SSIM (dB), " "P count, P ave-QP, P kbps, P-PSNR Y, P-PSNR U, P-PSNR V, P-SSIM (dB), " - "B count, B ave-QP, B kbps, B-PSNR Y, B-PSNR U, B-PSNR V, B-SSIM (dB), "; + "B count, B ave-QP, B kbps, B-PSNR Y, B-PSNR U, B-PSNR V, B-SSIM (dB), " + "MaxCLL, MaxFALL, Version\n"; + x265_encoder *x265_encoder_open(x265_param *p) { if (!p) @@ -190,10 +192,9 @@ { if (!enc || !param_in) return -1; + x265_param save; Encoder* encoder = static_cast<Encoder*>(enc); - if (encoder->m_param->csvfn == NULL && param_in->csvfpt != NULL) - encoder->m_param->csvfpt = param_in->csvfpt; if (encoder->m_latestParam->forceFlush != param_in->forceFlush) return encoder->reconfigureParam(encoder->m_latestParam, param_in); bool isReconfigureRc = encoder->isReconfigureRc(encoder->m_latestParam, param_in); @@ -310,9 +311,7 @@ Encoder *encoder = static_cast<Encoder*>(enc); x265_stats stats; encoder->fetchStats(&stats, sizeof(stats)); - int padx = encoder->m_sps.conformanceWindow.rightOffset; - int pady = encoder->m_sps.conformanceWindow.bottomOffset; - x265_csvlog_encode(encoder->m_param, &stats, padx, pady, argc, argv); + x265_csvlog_encode(enc, &stats, argc, argv); } } @@ -357,13 +356,13 @@ return -1; } -int x265_get_ref_frame_list(x265_encoder *enc, x265_picyuv** l0, x265_picyuv** l1, int sliceType, int poc, int* pocL0, int* pocL1) +int x265_get_ref_frame_list(x265_encoder *enc, x265_picyuv** l0, x265_picyuv** l1, int sliceType, int poc) { if (!enc) return -1; Encoder *encoder = static_cast<Encoder*>(enc); - return encoder->getRefFrameList((PicYuv**)l0, (PicYuv**)l1, sliceType, poc, pocL0, pocL1); + return encoder->getRefFrameList((PicYuv**)l0, (PicYuv**)l1, sliceType, poc); } int x265_set_analysis_data(x265_encoder *enc, x265_analysis_data *analysis_data, int poc, uint32_t cuBytes) @@ -399,7 +398,7 @@ pic->userSEI.payloads = NULL; pic->userSEI.numPayloads = 0; - if ((param->analysisSave || param->analysisLoad) || (param->bMVType == AVC_INFO)) + if (param->analysisReuseMode || (param->bMVType == AVC_INFO)) { uint32_t widthInCU = (param->sourceWidth + param->maxCUSize - 1) >> param->maxLog2CUSize; uint32_t heightInCU = (param->sourceHeight + param->maxCUSize - 1) >> param->maxLog2CUSize; @@ -755,12 +754,7 @@ fprintf(csvfp, "\n"); } else - { fputs(summaryCSVHeader, csvfp); - if (param->csvLogLevel >= 2 || param->maxCLL || param->maxFALL) - fputs("MaxCLL, MaxFALL,", csvfp); - fputs(" Version\n", csvfp); - } } return csvfp; } @@ -873,40 +867,45 @@ fflush(stderr); } -void x265_csvlog_encode(const x265_param *p, const x265_stats *stats, int padx, int pady, int argc, char** argv) +void x265_csvlog_encode(x265_encoder *enc, const x265_stats* stats, int argc, char** argv) { - if (p && p->csvfpt) + if (enc) { + Encoder *encoder = static_cast<Encoder*>(enc); + int padx = encoder->m_sps.conformanceWindow.rightOffset; + int pady = encoder->m_sps.conformanceWindow.bottomOffset; const x265_api * api = x265_api_get(0); - if (p->csvLogLevel) + if (!encoder->m_param->csvfpt) + return; + + if (encoder->m_param->csvLogLevel) { // adding summary to a per-frame csv log file, so it needs a summary header - fprintf(p->csvfpt, "\nSummary\n"); - fputs(summaryCSVHeader, p->csvfpt); - if (p->csvLogLevel >= 2 || p->maxCLL || p->maxFALL) - fputs("MaxCLL, MaxFALL,", p->csvfpt); - fputs(" Version\n",p->csvfpt); + fprintf(encoder->m_param->csvfpt, "\nSummary\n"); + fputs(summaryCSVHeader, encoder->m_param->csvfpt); } + // CLI arguments or other if (argc) { - fputc('"', p->csvfpt); + fputc('"', encoder->m_param->csvfpt); for (int i = 1; i < argc; i++) { - fputc(' ', p->csvfpt); - fputs(argv[i], p->csvfpt); + fputc(' ', encoder->m_param->csvfpt); + fputs(argv[i], encoder->m_param->csvfpt); } - fputc('"', p->csvfpt); + fputc('"', encoder->m_param->csvfpt); } else { - char *opts = x265_param2string((x265_param*)p, padx, pady); + const x265_param* paramTemp = encoder->m_param; + char *opts = x265_param2string((x265_param*)paramTemp, padx, pady); if (opts) { - fputc('"', p->csvfpt); - fputs(opts, p->csvfpt); - fputc('"', p->csvfpt); + fputc('"', encoder->m_param->csvfpt); + fputs(opts, encoder->m_param->csvfpt); + fputc('"', encoder->m_param->csvfpt); } } @@ -917,70 +916,69 @@ timeinfo = localtime(&now); char buffer[200]; strftime(buffer, 128, "%c", timeinfo); - fprintf(p->csvfpt, ", %s, ", buffer); + fprintf(encoder->m_param->csvfpt, ", %s, ", buffer); // elapsed time, fps, bitrate - fprintf(p->csvfpt, "%.2f, %.2f, %.2f,", + fprintf(encoder->m_param->csvfpt, "%.2f, %.2f, %.2f,", stats->elapsedEncodeTime, stats->encodedPictureCount / stats->elapsedEncodeTime, stats->bitrate); - if (p->bEnablePsnr) - fprintf(p->csvfpt, " %.3lf, %.3lf, %.3lf, %.3lf,", + if (encoder->m_param->bEnablePsnr) + fprintf(encoder->m_param->csvfpt, " %.3lf, %.3lf, %.3lf, %.3lf,", stats->globalPsnrY / stats->encodedPictureCount, stats->globalPsnrU / stats->encodedPictureCount, stats->globalPsnrV / stats->encodedPictureCount, stats->globalPsnr); else - fprintf(p->csvfpt, " -, -, -, -,"); - if (p->bEnableSsim) - fprintf(p->csvfpt, " %.6f, %6.3f,", stats->globalSsim, x265_ssim2dB(stats->globalSsim)); + fprintf(encoder->m_param->csvfpt, " -, -, -, -,"); + if (encoder->m_param->bEnableSsim) + fprintf(encoder->m_param->csvfpt, " %.6f, %6.3f,", stats->globalSsim, x265_ssim2dB(stats->globalSsim)); else - fprintf(p->csvfpt, " -, -,"); + fprintf(encoder->m_param->csvfpt, " -, -,"); if (stats->statsI.numPics) { - fprintf(p->csvfpt, " %-6u, %2.2lf, %-8.2lf,", stats->statsI.numPics, stats->statsI.avgQp, stats->statsI.bitrate); - if (p->bEnablePsnr) - fprintf(p->csvfpt, " %.3lf, %.3lf, %.3lf,", stats->statsI.psnrY, stats->statsI.psnrU, stats->statsI.psnrV); + fprintf(encoder->m_param->csvfpt, " %-6u, %2.2lf, %-8.2lf,", stats->statsI.numPics, stats->statsI.avgQp, stats->statsI.bitrate); + if (encoder->m_param->bEnablePsnr) + fprintf(encoder->m_param->csvfpt, " %.3lf, %.3lf, %.3lf,", stats->statsI.psnrY, stats->statsI.psnrU, stats->statsI.psnrV); else - fprintf(p->csvfpt, " -, -, -,"); - if (p->bEnableSsim) - fprintf(p->csvfpt, " %.3lf,", stats->statsI.ssim); + fprintf(encoder->m_param->csvfpt, " -, -, -,"); + if (encoder->m_param->bEnableSsim) + fprintf(encoder->m_param->csvfpt, " %.3lf,", stats->statsI.ssim); else - fprintf(p->csvfpt, " -,"); + fprintf(encoder->m_param->csvfpt, " -,"); } else - fprintf(p->csvfpt, " -, -, -, -, -, -, -,"); + fprintf(encoder->m_param->csvfpt, " -, -, -, -, -, -, -,"); if (stats->statsP.numPics) { - fprintf(p->csvfpt, " %-6u, %2.2lf, %-8.2lf,", stats->statsP.numPics, stats->statsP.avgQp, stats->statsP.bitrate); - if (p->bEnablePsnr) - fprintf(p->csvfpt, " %.3lf, %.3lf, %.3lf,", stats->statsP.psnrY, stats->statsP.psnrU, stats->statsP.psnrV); + fprintf(encoder->m_param->csvfpt, " %-6u, %2.2lf, %-8.2lf,", stats->statsP.numPics, stats->statsP.avgQp, stats->statsP.bitrate); + if (encoder->m_param->bEnablePsnr) + fprintf(encoder->m_param->csvfpt, " %.3lf, %.3lf, %.3lf,", stats->statsP.psnrY, stats->statsP.psnrU, stats->statsP.psnrV); else - fprintf(p->csvfpt, " -, -, -,"); - if (p->bEnableSsim) - fprintf(p->csvfpt, " %.3lf,", stats->statsP.ssim); + fprintf(encoder->m_param->csvfpt, " -, -, -,"); + if (encoder->m_param->bEnableSsim) + fprintf(encoder->m_param->csvfpt, " %.3lf,", stats->statsP.ssim); else - fprintf(p->csvfpt, " -,"); + fprintf(encoder->m_param->csvfpt, " -,"); } else - fprintf(p->csvfpt, " -, -, -, -, -, -, -,"); + fprintf(encoder->m_param->csvfpt, " -, -, -, -, -, -, -,"); if (stats->statsB.numPics) { - fprintf(p->csvfpt, " %-6u, %2.2lf, %-8.2lf,", stats->statsB.numPics, stats->statsB.avgQp, stats->statsB.bitrate); - if (p->bEnablePsnr) - fprintf(p->csvfpt, " %.3lf, %.3lf, %.3lf,", stats->statsB.psnrY, stats->statsB.psnrU, stats->statsB.psnrV); + fprintf(encoder->m_param->csvfpt, " %-6u, %2.2lf, %-8.2lf,", stats->statsB.numPics, stats->statsB.avgQp, stats->statsB.bitrate); + if (encoder->m_param->bEnablePsnr) + fprintf(encoder->m_param->csvfpt, " %.3lf, %.3lf, %.3lf,", stats->statsB.psnrY, stats->statsB.psnrU, stats->statsB.psnrV); else - fprintf(p->csvfpt, " -, -, -,"); - if (p->bEnableSsim) - fprintf(p->csvfpt, " %.3lf,", stats->statsB.ssim); + fprintf(encoder->m_param->csvfpt, " -, -, -,"); + if (encoder->m_param->bEnableSsim) + fprintf(encoder->m_param->csvfpt, " %.3lf,", stats->statsB.ssim); else - fprintf(p->csvfpt, " -,"); + fprintf(encoder->m_param->csvfpt, " -,"); } else - fprintf(p->csvfpt, " -, -, -, -, -, -, -,"); - if (p->csvLogLevel >= 2 || p->maxCLL || p->maxFALL) - fprintf(p->csvfpt, " %-6u, %-6u,", stats->maxCLL, stats->maxFALL); - fprintf(p->csvfpt, " %s\n", api->version_str); + fprintf(encoder->m_param->csvfpt, " -, -, -, -, -, -, -,"); + + fprintf(encoder->m_param->csvfpt, " %-6u, %-6u, %s\n", stats->maxCLL, stats->maxFALL, api->version_str); } }
View file
x265_2.7.tar.gz/source/encoder/dpb.cpp -> x265_2.6.tar.gz/source/encoder/dpb.cpp
Changed
@@ -92,14 +92,19 @@ m_freeList.pushBack(*curFrame); curFrame->m_encData->m_freeListNext = m_frameDataFreeList; m_frameDataFreeList = curFrame->m_encData; - for (int i = 0; i < INTEGRAL_PLANE_NUM; i++) + + if (curFrame->m_encData->m_meBuffer) { - if (curFrame->m_encData->m_meBuffer[i] != NULL) + for (int i = 0; i < INTEGRAL_PLANE_NUM; i++) { - X265_FREE(curFrame->m_encData->m_meBuffer[i]); - curFrame->m_encData->m_meBuffer[i] = NULL; + if (curFrame->m_encData->m_meBuffer[i] != NULL) + { + X265_FREE(curFrame->m_encData->m_meBuffer[i]); + curFrame->m_encData->m_meBuffer[i] = NULL; + } } } + if (curFrame->m_ctuInfo != NULL) { uint32_t widthInCU = (curFrame->m_param->sourceWidth + curFrame->m_param->maxCUSize - 1) >> curFrame->m_param->maxLog2CUSize; @@ -176,10 +181,7 @@ // Mark pictures in m_piclist as unreferenced if they are not included in RPS applyReferencePictureSet(&slice->m_rps, pocCurr); - if (slice->m_sliceType != I_SLICE) - slice->m_numRefIdx[0] = x265_clip3(1, newFrame->m_param->maxNumReferences, slice->m_rps.numberOfNegativePictures); - else - slice->m_numRefIdx[0] = X265_MIN(newFrame->m_param->maxNumReferences, slice->m_rps.numberOfNegativePictures); // Ensuring L0 contains just the -ve POC + slice->m_numRefIdx[0] = X265_MIN(newFrame->m_param->maxNumReferences, slice->m_rps.numberOfNegativePictures); // Ensuring L0 contains just the -ve POC slice->m_numRefIdx[1] = X265_MIN(newFrame->m_param->bBPyramid ? 2 : 1, slice->m_rps.numberOfPositivePictures); slice->setRefPicList(m_picList); @@ -228,14 +230,11 @@ { if ((iterPic->m_poc != curPoc) && iterPic->m_encData->m_bHasReferences) { - if ((m_lastIDR >= curPoc) || (m_lastIDR <= iterPic->m_poc)) - { - rps->poc[poci] = iterPic->m_poc; - rps->deltaPOC[poci] = rps->poc[poci] - curPoc; - (rps->deltaPOC[poci] < 0) ? numNeg++ : numPos++; - rps->bUsed[poci] = !isRAP; - poci++; - } + rps->poc[poci] = iterPic->m_poc; + rps->deltaPOC[poci] = rps->poc[poci] - curPoc; + (rps->deltaPOC[poci] < 0) ? numNeg++ : numPos++; + rps->bUsed[poci] = !isRAP; + poci++; } iterPic = iterPic->m_next; }
View file
x265_2.7.tar.gz/source/encoder/encoder.cpp -> x265_2.6.tar.gz/source/encoder/encoder.cpp
Changed
@@ -50,8 +50,10 @@ /* Threshold for motion vection, based on expermental result. * TODO: come up an algorithm for adoptive threshold */ -#define MVTHRESHOLD (10*10) + +#define MVTHRESHOLD 10 #define PU_2Nx2N 1 + static const char* defaultAnalysisFileName = "x265_analysis.dat"; using namespace X265_NS; @@ -77,6 +79,7 @@ m_param = NULL; m_latestParam = NULL; m_threadPool = NULL; + m_analysisFile = NULL; m_analysisFileIn = NULL; m_analysisFileOut = NULL; m_offsetEmergency = NULL; @@ -341,29 +344,19 @@ m_aborted = true; if (!m_lookahead->create()) m_aborted = true; + initRefIdx(); - if (m_param->analysisSave && m_param->bUseAnalysisFile) - { - char* temp = strcatFilename(m_param->analysisSave, ".temp"); - if (!temp) - m_aborted = true; - else - { - m_analysisFileOut = x265_fopen(temp, "wb"); - X265_FREE(temp); - } - if (!m_analysisFileOut) - { - x265_log_file(NULL, X265_LOG_ERROR, "Analysis save: failed to open file %s.temp\n", m_param->analysisSave); - m_aborted = true; - } - } - if (m_param->analysisLoad && m_param->bUseAnalysisFile) + + if (m_param->analysisReuseMode) { - m_analysisFileIn = x265_fopen(m_param->analysisLoad, "rb"); - if (!m_analysisFileIn) + const char* name = m_param->analysisReuseFileName; + if (!name) + name = defaultAnalysisFileName; + const char* mode = m_param->analysisReuseMode == X265_ANALYSIS_LOAD ? "rb" : "wb"; + m_analysisFile = x265_fopen(name, mode); + if (!m_analysisFile) { - x265_log_file(NULL, X265_LOG_ERROR, "Analysis load: failed to open file %s\n", m_param->analysisLoad); + x265_log_file(NULL, X265_LOG_ERROR, "Analysis load/save: failed to open file %s\n", name); m_aborted = true; } } @@ -457,7 +450,7 @@ return 0; } -int Encoder::getRefFrameList(PicYuv** l0, PicYuv** l1, int sliceType, int poc, int* pocL0, int* pocL1) +int Encoder::getRefFrameList(PicYuv** l0, PicYuv** l1, int sliceType, int poc) { if (!(IS_X265_TYPE_I(sliceType))) { @@ -469,10 +462,9 @@ if (framePtr->m_encData->m_slice->m_refFrameList[0][j] && framePtr->m_encData->m_slice->m_refFrameList[0][j]->m_reconPic != NULL) { int l0POC = framePtr->m_encData->m_slice->m_refFrameList[0][j]->m_poc; - pocL0[j] = l0POC; Frame* l0Fp = m_dpb->m_picList.getPOC(l0POC); - while (l0Fp->m_reconRowFlag[l0Fp->m_numRows - 1].get() == 0) - l0Fp->m_reconRowFlag[l0Fp->m_numRows - 1].waitForChange(0); /* If recon is not ready, current frame encoder has to wait. */ + if (l0Fp->m_reconPic->m_picOrg[0] == NULL) + l0Fp->m_reconEncoded.wait(); /* If recon is not ready, current frame encoder need to wait. */ l0[j] = l0Fp->m_reconPic; } } @@ -481,19 +473,15 @@ if (framePtr->m_encData->m_slice->m_refFrameList[1][j] && framePtr->m_encData->m_slice->m_refFrameList[1][j]->m_reconPic != NULL) { int l1POC = framePtr->m_encData->m_slice->m_refFrameList[1][j]->m_poc; - pocL1[j] = l1POC; Frame* l1Fp = m_dpb->m_picList.getPOC(l1POC); - while (l1Fp->m_reconRowFlag[l1Fp->m_numRows - 1].get() == 0) - l1Fp->m_reconRowFlag[l1Fp->m_numRows - 1].waitForChange(0); /* If recon is not ready, current frame encoder has to wait. */ + if (l1Fp->m_reconPic->m_picOrg[0] == NULL) + l1Fp->m_reconEncoded.wait(); /* If recon is not ready, current frame encoder need to wait. */ l1[j] = l1Fp->m_reconPic; } } } else - { - x265_log(NULL, X265_LOG_WARNING, "Current frame is not in DPB piclist.\n"); - return 1; - } + x265_log(NULL, X265_LOG_WARNING, "Refrence List is not in piclist\n"); } else { @@ -576,19 +564,19 @@ { int cuOffset = cuI * bytes + pu; (interData)->mergeFlag[cuPos + cuOffset] = (srcInterData)->mergeFlag[(mbIndex * 16) + cuOffset]; - (interData)->sadCost[cuPos + cuOffset] = (srcInterData)->sadCost[(mbIndex * 16) + cuOffset]; + (interData)->interDir[cuPos + cuOffset] = (srcInterData)->interDir[(mbIndex * 16) + cuOffset]; for (uint32_t k = 0; k < numDir; k++) { (interData)->mvpIdx[k][cuPos + cuOffset] = (srcInterData)->mvpIdx[k][(mbIndex * 16) + cuOffset]; (interData)->refIdx[k][cuPos + cuOffset] = (srcInterData)->refIdx[k][(mbIndex * 16) + cuOffset]; memcpy(&(interData)->mv[k][cuPos + cuOffset], &(srcInterData)->mv[k][(mbIndex * 16) + cuOffset], sizeof(MV)); - if (m_param->analysisReuseLevel == 7 && numPU == PU_2Nx2N && - ((interData)->depth[cuPos + cuOffset] == (m_param->maxCUSize >> 5))) + if (m_param->analysisReuseLevel == 7) { - int mv_x = (interData)->mv[k][cuPos + cuOffset].x; - int mv_y = (interData)->mv[k][cuPos + cuOffset].y; - if ((mv_x*mv_x + mv_y*mv_y) <= MVTHRESHOLD) + int mv_x = ((analysis_inter_data *)curFrame->m_analysisData.interData)->mv[k][(mbIndex * 16) + cuOffset].x; + int mv_y = ((analysis_inter_data *)curFrame->m_analysisData.interData)->mv[k][(mbIndex * 16) + cuOffset].y; + double mv = sqrt(mv_x*mv_x + mv_y*mv_y); + if (numPU == PU_2Nx2N && ((srcInterData)->depth[cuPos + cuOffset] == (m_param->maxCUSize >> 5)) && mv <= MVTHRESHOLD) memset(&curFrame->m_analysisData.modeFlag[k][cuPos + cuOffset], 1, bytes); } } @@ -654,10 +642,9 @@ if (m_param->analysisReuseLevel > 4) { memset(&(currInterData)->partSize[count], (interData)->partSize[d], bytes); - int numPU = nbPartsTable[(interData)->partSize[d]]; - for (int pu = 0; pu < numPU; pu++) + int numPU = nbPartsTable[(currInterData)->partSize[d]]; + for (int pu = 0; pu < numPU; pu++, d++) { - if (pu) d++; (currInterData)->mergeFlag[count + pu] = (interData)->mergeFlag[d]; if (m_param->analysisReuseLevel >= 7) { @@ -667,11 +654,12 @@ (currInterData)->mvpIdx[i][count + pu] = (interData)->mvpIdx[i][d]; (currInterData)->refIdx[i][count + pu] = (interData)->refIdx[i][d]; memcpy(&(currInterData)->mv[i][count + pu], &(interData)->mv[i][d], sizeof(MV)); - if (m_param->analysisReuseLevel == 7 && numPU == PU_2Nx2N && m_param->num4x4Partitions <= 16) + if (m_param->analysisReuseLevel == 7) { - int mv_x = (currInterData)->mv[i][count + pu].x; - int mv_y = (currInterData)->mv[i][count + pu].y; - if ((mv_x*mv_x + mv_y*mv_y) <= MVTHRESHOLD) + int mv_x = ((analysis_inter_data *)curFrame->m_analysisData.interData)->mv[i][count + pu].x; + int mv_y = ((analysis_inter_data *)curFrame->m_analysisData.interData)->mv[i][count + pu].y; + double mv = sqrt(mv_x*mv_x + mv_y*mv_y); + if (numPU == PU_2Nx2N && m_param->num4x4Partitions <= 16 && mv <= MVTHRESHOLD) memset(&curFrame->m_analysisData.modeFlag[i][count + pu], 1, bytes); } } @@ -732,6 +720,9 @@ X265_FREE(m_offsetEmergency); + if (m_analysisFile) + fclose(m_analysisFile); + if (m_latestParam != NULL && m_latestParam != m_param) { if (m_latestParam->scalingLists != m_param->scalingLists) @@ -746,7 +737,7 @@ { int bError = 1; fclose(m_analysisFileOut); - const char* name = m_param->analysisSave ? m_param->analysisSave : m_param->analysisReuseFileName; + const char* name = m_param->analysisReuseFileName; if (!name) name = defaultAnalysisFileName; char* temp = strcatFilename(name, ".temp"); @@ -774,8 +765,6 @@ free((char*)m_param->numaPools); free((char*)m_param->masteringDisplayColorVolume); free((char*)m_param->toneMapFile); - free((char*)m_param->analysisSave); - free((char*)m_param->analysisLoad); PARAM_NS::x265_param_free(m_param); } } @@ -862,7 +851,7 @@ if (m_exportedPic) { - if (!m_param->bUseAnalysisFile && m_param->analysisSave) + if (!m_param->bUseAnalysisFile && m_param->analysisReuseMode == X265_ANALYSIS_SAVE) freeAnalysis(&m_exportedPic->m_analysisData); ATOMIC_DEC(&m_exportedPic->m_countRefEncoders); m_exportedPic = NULL; @@ -1047,7 +1036,7 @@ /* In analysisSave mode, x265_analysis_data is allocated in pic_in and inFrame points to this */ /* Load analysis data before lookahead->addPicture, since sliceType has been decided */ - if (m_param->analysisLoad) + if (m_param->analysisReuseMode == X265_ANALYSIS_LOAD) { /* readAnalysisFile reads analysis data for the frame and allocates memory based on slicetype */ readAnalysisFile(&inFrame->m_analysisData, inFrame->m_poc, pic_in); @@ -1060,14 +1049,11 @@ inFrame->m_lowres.sliceType = sliceType; inFrame->m_lowres.bKeyframe = !!inFrame->m_analysisData.lookahead.keyframe; inFrame->m_lowres.bLastMiniGopBFrame = !!inFrame->m_analysisData.lookahead.lastMiniGopBFrame; - if (m_rateControl->m_isVbv) + int vbvCount = m_param->lookaheadDepth + m_param->bframes + 2; + for (int index = 0; index < vbvCount; index++) { - int vbvCount = m_param->lookaheadDepth + m_param->bframes + 2; - for (int index = 0; index < vbvCount; index++) - { - inFrame->m_lowres.plannedSatd[index] = inFrame->m_analysisData.lookahead.plannedSatd[index]; - inFrame->m_lowres.plannedType[index] = inFrame->m_analysisData.lookahead.plannedType[index]; - } + inFrame->m_lowres.plannedSatd[index] = inFrame->m_analysisData.lookahead.plannedSatd[index]; + inFrame->m_lowres.plannedType[index] = inFrame->m_analysisData.lookahead.plannedType[index]; } } } @@ -1132,7 +1118,7 @@ x265_frame_stats* frameData = NULL; /* Free up pic_in->analysisData since it has already been used */ - if ((m_param->analysisLoad && !m_param->analysisSave) || (m_param->bMVType && slice->m_sliceType != I_SLICE)) + if (m_param->analysisReuseMode == X265_ANALYSIS_LOAD || (m_param->bMVType && slice->m_sliceType != I_SLICE)) freeAnalysis(&outFrame->m_analysisData); if (pic_out) @@ -1158,7 +1144,7 @@ } /* Dump analysis data from pic_out to file in save mode and free */ - if (m_param->analysisSave) + if (m_param->analysisReuseMode == X265_ANALYSIS_SAVE) { pic_out->analysisData.poc = pic_out->poc; pic_out->analysisData.sliceType = pic_out->sliceType; @@ -1181,29 +1167,26 @@ pic_out->analysisData.satdCost *= factor; pic_out->analysisData.lookahead.keyframe = outFrame->m_lowres.bKeyframe; pic_out->analysisData.lookahead.lastMiniGopBFrame = outFrame->m_lowres.bLastMiniGopBFrame; - if (m_rateControl->m_isVbv) + int vbvCount = m_param->lookaheadDepth + m_param->bframes + 2; + for (int index = 0; index < vbvCount; index++) { - int vbvCount = m_param->lookaheadDepth + m_param->bframes + 2; - for (int index = 0; index < vbvCount; index++) - { - pic_out->analysisData.lookahead.plannedSatd[index] = outFrame->m_lowres.plannedSatd[index] * factor; - pic_out->analysisData.lookahead.plannedType[index] = outFrame->m_lowres.plannedType[index]; - } - for (uint32_t index = 0; index < pic_out->analysisData.numCuInHeight; index++) - { - outFrame->m_analysisData.lookahead.intraSatdForVbv[index] = outFrame->m_encData->m_rowStat[index].intraSatdForVbv * factor; - outFrame->m_analysisData.lookahead.satdForVbv[index] = outFrame->m_encData->m_rowStat[index].satdForVbv * factor; - } - pic_out->analysisData.lookahead.intraSatdForVbv = outFrame->m_analysisData.lookahead.intraSatdForVbv; - pic_out->analysisData.lookahead.satdForVbv = outFrame->m_analysisData.lookahead.satdForVbv; - for (uint32_t index = 0; index < pic_out->analysisData.numCUsInFrame; index++) - { - outFrame->m_analysisData.lookahead.intraVbvCost[index] = outFrame->m_encData->m_cuStat[index].intraVbvCost * factor; - outFrame->m_analysisData.lookahead.vbvCost[index] = outFrame->m_encData->m_cuStat[index].vbvCost * factor; - } - pic_out->analysisData.lookahead.intraVbvCost = outFrame->m_analysisData.lookahead.intraVbvCost; - pic_out->analysisData.lookahead.vbvCost = outFrame->m_analysisData.lookahead.vbvCost; + pic_out->analysisData.lookahead.plannedSatd[index] = outFrame->m_lowres.plannedSatd[index] * factor; + pic_out->analysisData.lookahead.plannedType[index] = outFrame->m_lowres.plannedType[index]; + } + for (uint32_t index = 0; index < pic_out->analysisData.numCuInHeight; index++) + { + outFrame->m_analysisData.lookahead.intraSatdForVbv[index] = outFrame->m_encData->m_rowStat[index].intraSatdForVbv * factor; + outFrame->m_analysisData.lookahead.satdForVbv[index] = outFrame->m_encData->m_rowStat[index].satdForVbv * factor; + } + pic_out->analysisData.lookahead.intraSatdForVbv = outFrame->m_analysisData.lookahead.intraSatdForVbv; + pic_out->analysisData.lookahead.satdForVbv = outFrame->m_analysisData.lookahead.satdForVbv; + for (uint32_t index = 0; index < pic_out->analysisData.numCUsInFrame; index++) + { + outFrame->m_analysisData.lookahead.intraVbvCost[index] = outFrame->m_encData->m_cuStat[index].intraVbvCost * factor; + outFrame->m_analysisData.lookahead.vbvCost[index] = outFrame->m_encData->m_cuStat[index].vbvCost * factor; } + pic_out->analysisData.lookahead.intraVbvCost = outFrame->m_analysisData.lookahead.intraVbvCost; + pic_out->analysisData.lookahead.vbvCost = outFrame->m_analysisData.lookahead.vbvCost; } writeAnalysisFile(&pic_out->analysisData, *outFrame->m_encData); if (m_param->bUseAnalysisFile) @@ -1367,21 +1350,18 @@ slice->m_maxNumMergeCand = m_param->maxNumMergeCand; slice->m_endCUAddr = slice->realEndAddress(m_sps.numCUsInFrame * m_param->num4x4Partitions); } - if (m_param->analysisLoad && m_param->bDisableLookahead) + if (m_param->analysisReuseMode == X265_ANALYSIS_LOAD && m_param->bDisableLookahead) { frameEnc->m_dts = frameEnc->m_analysisData.lookahead.dts; - if (m_rateControl->m_isVbv) + for (uint32_t index = 0; index < frameEnc->m_analysisData.numCuInHeight; index++) { - for (uint32_t index = 0; index < frameEnc->m_analysisData.numCuInHeight; index++) - { - frameEnc->m_encData->m_rowStat[index].intraSatdForVbv = frameEnc->m_analysisData.lookahead.intraSatdForVbv[index]; - frameEnc->m_encData->m_rowStat[index].satdForVbv = frameEnc->m_analysisData.lookahead.satdForVbv[index]; - } - for (uint32_t index = 0; index < frameEnc->m_analysisData.numCUsInFrame; index++) - { - frameEnc->m_encData->m_cuStat[index].intraVbvCost = frameEnc->m_analysisData.lookahead.intraVbvCost[index]; - frameEnc->m_encData->m_cuStat[index].vbvCost = frameEnc->m_analysisData.lookahead.vbvCost[index]; - } + frameEnc->m_encData->m_rowStat[index].intraSatdForVbv = frameEnc->m_analysisData.lookahead.intraSatdForVbv[index]; + frameEnc->m_encData->m_rowStat[index].satdForVbv = frameEnc->m_analysisData.lookahead.satdForVbv[index]; + } + for (uint32_t index = 0; index < frameEnc->m_analysisData.numCUsInFrame; index++) + { + frameEnc->m_encData->m_cuStat[index].intraVbvCost = frameEnc->m_analysisData.lookahead.intraVbvCost[index]; + frameEnc->m_encData->m_cuStat[index].vbvCost = frameEnc->m_analysisData.lookahead.vbvCost[index]; } } if (m_param->searchMethod == X265_SEA && frameEnc->m_lowres.sliceType != X265_TYPE_B) @@ -1436,7 +1416,7 @@ frameEnc->m_encData->m_slice->m_iNumRPSInSPS = m_sps.spsrpsNum; curEncoder->m_rce.encodeOrder = frameEnc->m_encodeOrder = m_encodedFrameNum++; - if (!m_param->analysisLoad || !m_param->bDisableLookahead) + if (m_param->analysisReuseMode != X265_ANALYSIS_LOAD || !m_param->bDisableLookahead) { if (m_bframeDelay) { @@ -1451,7 +1431,7 @@ } /* Allocate analysis data before encode in save mode. This is allocated in frameEnc */ - if (m_param->analysisSave && !m_param->analysisLoad) + if (m_param->analysisReuseMode == X265_ANALYSIS_SAVE) { x265_analysis_data* analysis = &frameEnc->m_analysisData; analysis->poc = frameEnc->m_poc; @@ -1983,12 +1963,11 @@ stats->statsB.psnrU = m_analyzeB.m_psnrSumU / (double)m_analyzeB.m_numPics; stats->statsB.psnrV = m_analyzeB.m_psnrSumV / (double)m_analyzeB.m_numPics; stats->statsB.ssim = x265_ssim2dB(m_analyzeB.m_globalSsim / (double)m_analyzeB.m_numPics); - if (m_param->csvLogLevel >= 2 || m_param->maxCLL || m_param->maxFALL) - { - stats->maxCLL = m_analyzeAll.m_maxCLL; - stats->maxFALL = (uint16_t)(m_analyzeAll.m_maxFALL / m_analyzeAll.m_numPics); - } + + stats->maxCLL = m_analyzeAll.m_maxCLL; + stats->maxFALL = (uint16_t)(m_analyzeAll.m_maxFALL / m_analyzeAll.m_numPics); } + /* If new statistics are added to x265_stats, we must check here whether the * structure provided by the user is the new structure or an older one (for * future safety) */ @@ -2060,11 +2039,10 @@ if (m_param->bEnableSsim) m_analyzeB.addSsim(ssim); } - if (m_param->csvLogLevel >= 2 || m_param->maxCLL || m_param->maxFALL) - { - m_analyzeAll.m_maxFALL += curFrame->m_fencPic->m_avgLumaLevel; - m_analyzeAll.m_maxCLL = X265_MAX(m_analyzeAll.m_maxCLL, curFrame->m_fencPic->m_maxLumaLevel); - } + + m_analyzeAll.m_maxFALL += curFrame->m_fencPic->m_avgLumaLevel; + m_analyzeAll.m_maxCLL = X265_MAX(m_analyzeAll.m_maxCLL, curFrame->m_fencPic->m_maxLumaLevel); + char c = (slice->isIntra() ? (curFrame->m_lowres.sliceType == X265_TYPE_IDR ? 'I' : 'i') : slice->isInterP() ? 'P' : 'B'); int poc = slice->m_poc; if (!IS_REFERENCED(curFrame)) @@ -2103,7 +2081,13 @@ frameStats->list1POC[ref] = ref < slice->m_numRefIdx[1] ? slice->m_refPOCList[1][ref] - slice->m_lastIDR : -1; } } + #define ELAPSED_MSEC(start, end) (((double)(end) - (start)) / 1000) + + frameStats->maxLumaLevel = curFrame->m_fencPic->m_maxLumaLevel; + frameStats->minLumaLevel = curFrame->m_fencPic->m_minLumaLevel; + frameStats->avgLumaLevel = curFrame->m_fencPic->m_avgLumaLevel; + if (m_param->csvLogLevel >= 2) { frameStats->decideWaitTime = ELAPSED_MSEC(0, curEncoder->m_slicetypeWaitTime); @@ -2123,9 +2107,6 @@ frameStats->avgLumaDistortion = curFrame->m_encData->m_frameStats.avgLumaDistortion; frameStats->avgPsyEnergy = curFrame->m_encData->m_frameStats.avgPsyEnergy; frameStats->avgResEnergy = curFrame->m_encData->m_frameStats.avgResEnergy; - frameStats->maxLumaLevel = curFrame->m_fencPic->m_maxLumaLevel; - frameStats->minLumaLevel = curFrame->m_fencPic->m_minLumaLevel; - frameStats->avgLumaLevel = curFrame->m_fencPic->m_avgLumaLevel; frameStats->maxChromaULevel = curFrame->m_fencPic->m_maxChromaULevel; frameStats->minChromaULevel = curFrame->m_fencPic->m_minChromaULevel; @@ -2304,7 +2285,7 @@ if (buffer) { sprintf(buffer, "x265 (build %d) - %s:%s - H.265/HEVC codec - " - "Copyright 2013-2018 (c) Multicoreware, Inc - " + "Copyright 2013-2017 (c) Multicoreware, Inc - " "http://x265.org - options: %s", X265_BUILD, PFX(version_str), PFX(build_info_str), opts); @@ -2470,18 +2451,6 @@ this->m_externalFlush = true; else this->m_externalFlush = false; - - if (p->bMVType == AVC_INFO && (p->limitTU == 3 || p->limitTU == 4)) - { - x265_log(p, X265_LOG_WARNING, "limit TU = 3 or 4 with MVType AVCINFO produces inconsistent output\n"); - } - - if (p->bMVType == AVC_INFO && p->minCUSize != 8) - { - p->minCUSize = 8; - x265_log(p, X265_LOG_WARNING, "Setting minCuSize = 8, AVCINFO expects 8x8 blocks\n"); - } - if (p->keyframeMax < 0) { /* A negative max GOP size indicates the user wants only one I frame at @@ -2635,24 +2604,23 @@ p->rc.rfConstantMin = 0; } - if ((p->analysisLoad || p->analysisSave) && (p->bDistributeModeAnalysis || p->bDistributeMotionEstimation)) + if (p->analysisReuseMode && (p->bDistributeModeAnalysis || p->bDistributeMotionEstimation)) { x265_log(p, X265_LOG_WARNING, "Analysis load/save options incompatible with pmode/pme, Disabling pmode/pme\n"); p->bDistributeMotionEstimation = p->bDistributeModeAnalysis = 0; } - if ((p->analysisLoad || p->analysisSave) && p->rc.cuTree) + if (p->analysisReuseMode && p->rc.cuTree) { x265_log(p, X265_LOG_WARNING, "Analysis load/save options works only with cu-tree off, Disabling cu-tree\n"); p->rc.cuTree = 0; } - if ((p->analysisLoad || p->analysisSave) && (p->analysisMultiPassRefine || p->analysisMultiPassDistortion)) + if (p->analysisReuseMode && (p->analysisMultiPassRefine || p->analysisMultiPassDistortion)) { x265_log(p, X265_LOG_WARNING, "Cannot use Analysis load/save option and multi-pass-opt-analysis/multi-pass-opt-distortion together," "Disabling Analysis load/save and multi-pass-opt-analysis/multi-pass-opt-distortion\n"); - p->analysisSave = p->analysisLoad = NULL; - p->analysisMultiPassRefine = p->analysisMultiPassDistortion = 0; + p->analysisReuseMode = p->analysisMultiPassRefine = p->analysisMultiPassDistortion = 0; } if (p->scaleFactor) { @@ -2660,16 +2628,16 @@ { p->scaleFactor = 0; } - else if ((!p->analysisLoad && !p->analysisSave) || p->analysisReuseLevel < 10) + else if (!p->analysisReuseMode || p->analysisReuseLevel < 10) { - x265_log(p, X265_LOG_WARNING, "Input scaling works with analysis load/save, analysis-reuse-level 10. Disabling scale-factor.\n"); + x265_log(p, X265_LOG_WARNING, "Input scaling works with analysis-reuse-mode, analysis-reuse-level 10. Disabling scale-factor.\n"); p->scaleFactor = 0; } } if (p->intraRefine) { - if (!p->analysisLoad || p->analysisReuseLevel < 10 || !p->scaleFactor) + if (p->analysisReuseMode!= X265_ANALYSIS_LOAD || p->analysisReuseLevel < 10 || !p->scaleFactor) { x265_log(p, X265_LOG_WARNING, "Intra refinement requires analysis load, analysis-reuse-level 10, scale factor. Disabling intra refine.\n"); p->intraRefine = 0; @@ -2678,7 +2646,7 @@ if (p->interRefine) { - if (!p->analysisLoad || p->analysisReuseLevel < 10 || !p->scaleFactor) + if (p->analysisReuseMode != X265_ANALYSIS_LOAD || p->analysisReuseLevel < 10 || !p->scaleFactor) { x265_log(p, X265_LOG_WARNING, "Inter refinement requires analysis load, analysis-reuse-level 10, scale factor. Disabling inter refine.\n"); p->interRefine = 0; @@ -2693,7 +2661,7 @@ if (p->mvRefine) { - if (!p->analysisLoad || p->analysisReuseLevel < 10 || !p->scaleFactor) + if (p->analysisReuseMode != X265_ANALYSIS_LOAD || p->analysisReuseLevel < 10 || !p->scaleFactor) { x265_log(p, X265_LOG_WARNING, "MV refinement requires analysis load, analysis-reuse-level 10, scale factor. Disabling MV refine.\n"); p->mvRefine = 0; @@ -2794,7 +2762,7 @@ m_conformanceWindow.bottomOffset = 0; m_conformanceWindow.leftOffset = 0; /* set pad size if width is not multiple of the minimum CU size */ - if (p->scaleFactor == 2 && ((p->sourceWidth / 2) & (p->minCUSize - 1)) && p->analysisLoad) + if (p->scaleFactor == 2 && ((p->sourceWidth / 2) & (p->minCUSize - 1)) && p->analysisReuseMode == X265_ANALYSIS_LOAD) { uint32_t rem = (p->sourceWidth / 2) & (p->minCUSize - 1); uint32_t padsize = p->minCUSize - rem; @@ -2983,7 +2951,7 @@ } } /* set pad size if height is not multiple of the minimum CU size */ - if (p->scaleFactor == 2 && ((p->sourceHeight / 2) & (p->minCUSize - 1)) && p->analysisLoad) + if (p->scaleFactor == 2 && ((p->sourceHeight / 2) & (p->minCUSize - 1)) && p->analysisReuseMode == X265_ANALYSIS_LOAD) { uint32_t rem = (p->sourceHeight / 2) & (p->minCUSize - 1); uint32_t padsize = p->minCUSize - rem; @@ -3048,19 +3016,13 @@ p->maxCUDepth = p->maxLog2CUSize - g_log2Size[p->minCUSize]; p->unitSizeDepth = p->maxLog2CUSize - LOG2_UNIT_SIZE; p->num4x4Partitions = (1U << (p->unitSizeDepth << 1)); - - if (p->radl && (p->keyframeMax != p->keyframeMin)) - { - p->radl = 0; - x265_log(p, X265_LOG_WARNING, "Radl requires fixed gop-length (keyint == min-keyint). Disabling radl.\n"); - } } void Encoder::allocAnalysis(x265_analysis_data* analysis) { X265_CHECK(analysis->sliceType, "invalid slice type\n"); analysis->interData = analysis->intraData = NULL; - if (m_param->bDisableLookahead && m_rateControl->m_isVbv) + if (m_param->bDisableLookahead) { CHECKED_MALLOC_ZERO(analysis->lookahead.intraSatdForVbv, uint32_t, analysis->numCuInHeight); CHECKED_MALLOC_ZERO(analysis->lookahead.satdForVbv, uint32_t, analysis->numCuInHeight); @@ -3102,14 +3064,14 @@ if (m_param->analysisReuseLevel >= 7) { CHECKED_MALLOC(interData->interDir, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); - CHECKED_MALLOC(interData->sadCost, int64_t, analysis->numPartitions * analysis->numCUsInFrame); for (int dir = 0; dir < numDir; dir++) { CHECKED_MALLOC(interData->mvpIdx[dir], uint8_t, analysis->numPartitions * analysis->numCUsInFrame); CHECKED_MALLOC(interData->refIdx[dir], int8_t, analysis->numPartitions * analysis->numCUsInFrame); CHECKED_MALLOC(interData->mv[dir], MV, analysis->numPartitions * analysis->numCUsInFrame); - CHECKED_MALLOC_ZERO(analysis->modeFlag[dir], uint8_t, analysis->numPartitions * analysis->numCUsInFrame); + CHECKED_MALLOC(analysis->modeFlag[dir], uint8_t, analysis->numPartitions * analysis->numCUsInFrame); } + /* Allocate intra in inter */ if (analysis->sliceType == X265_TYPE_P || m_param->bIntraInBFrames) { @@ -3131,9 +3093,10 @@ freeAnalysis(analysis); m_aborted = true; } + void Encoder::freeAnalysis(x265_analysis_data* analysis) { - if (m_param->bDisableLookahead && m_rateControl->m_isVbv) + if (m_param->bDisableLookahead) { X265_FREE(analysis->lookahead.satdForVbv); X265_FREE(analysis->lookahead.intraSatdForVbv); @@ -3286,31 +3249,31 @@ static uint64_t consumedBytes = 0; static uint64_t totalConsumedBytes = 0; uint32_t depthBytes = 0; - if (m_param->bUseAnalysisFile) - fseeko(m_analysisFileIn, totalConsumedBytes, SEEK_SET); + fseeko(m_analysisFile, totalConsumedBytes, SEEK_SET); + const x265_analysis_data *picData = &(picIn->analysisData); analysis_intra_data *intraPic = (analysis_intra_data *)picData->intraData; analysis_inter_data *interPic = (analysis_inter_data *)picData->interData; int poc; uint32_t frameRecordSize; - X265_FREAD(&frameRecordSize, sizeof(uint32_t), 1, m_analysisFileIn, &(picData->frameRecordSize)); - X265_FREAD(&depthBytes, sizeof(uint32_t), 1, m_analysisFileIn, &(picData->depthBytes)); - X265_FREAD(&poc, sizeof(int), 1, m_analysisFileIn, &(picData->poc)); + X265_FREAD(&frameRecordSize, sizeof(uint32_t), 1, m_analysisFile, &(picData->frameRecordSize)); + X265_FREAD(&depthBytes, sizeof(uint32_t), 1, m_analysisFile, &(picData->depthBytes)); + X265_FREAD(&poc, sizeof(int), 1, m_analysisFile, &(picData->poc)); if (m_param->bUseAnalysisFile) { uint64_t currentOffset = totalConsumedBytes; /* Seeking to the right frame Record */ - while (poc != curPoc && !feof(m_analysisFileIn)) + while (poc != curPoc && !feof(m_analysisFile)) { currentOffset += frameRecordSize; - fseeko(m_analysisFileIn, currentOffset, SEEK_SET); - X265_FREAD(&frameRecordSize, sizeof(uint32_t), 1, m_analysisFileIn, &(picData->frameRecordSize)); - X265_FREAD(&depthBytes, sizeof(uint32_t), 1, m_analysisFileIn, &(picData->depthBytes)); - X265_FREAD(&poc, sizeof(int), 1, m_analysisFileIn, &(picData->poc)); + fseeko(m_analysisFile, currentOffset, SEEK_SET); + X265_FREAD(&frameRecordSize, sizeof(uint32_t), 1, m_analysisFile, &(picData->frameRecordSize)); + X265_FREAD(&depthBytes, sizeof(uint32_t), 1, m_analysisFile, &(picData->depthBytes)); + X265_FREAD(&poc, sizeof(int), 1, m_analysisFile, &(picData->poc)); } - if (poc != curPoc || feof(m_analysisFileIn)) + if (poc != curPoc || feof(m_analysisFile)) { x265_log(NULL, X265_LOG_WARNING, "Error reading analysis data: Cannot find POC %d\n", curPoc); freeAnalysis(analysis); @@ -3321,29 +3284,30 @@ /* Now arrived at the right frame, read the record */ analysis->poc = poc; analysis->frameRecordSize = frameRecordSize; - X265_FREAD(&analysis->sliceType, sizeof(int), 1, m_analysisFileIn, &(picData->sliceType)); - X265_FREAD(&analysis->bScenecut, sizeof(int), 1, m_analysisFileIn, &(picData->bScenecut)); - X265_FREAD(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFileIn, &(picData->satdCost)); - X265_FREAD(&analysis->numCUsInFrame, sizeof(int), 1, m_analysisFileIn, &(picData->numCUsInFrame)); - X265_FREAD(&analysis->numPartitions, sizeof(int), 1, m_analysisFileIn, &(picData->numPartitions)); + X265_FREAD(&analysis->sliceType, sizeof(int), 1, m_analysisFile, &(picData->sliceType)); + X265_FREAD(&analysis->bScenecut, sizeof(int), 1, m_analysisFile, &(picData->bScenecut)); + X265_FREAD(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFile, &(picData->satdCost)); + X265_FREAD(&analysis->numCUsInFrame, sizeof(int), 1, m_analysisFile, &(picData->numCUsInFrame)); + X265_FREAD(&analysis->numPartitions, sizeof(int), 1, m_analysisFile, &(picData->numPartitions)); if (m_param->bDisableLookahead) { - X265_FREAD(&analysis->numCuInHeight, sizeof(uint32_t), 1, m_analysisFileIn, &(picData->numCuInHeight)); - X265_FREAD(&analysis->lookahead, sizeof(x265_lookahead_data), 1, m_analysisFileIn, &(picData->lookahead)); + X265_FREAD(&analysis->numCuInHeight, sizeof(uint32_t), 1, m_analysisFile, &(picData->numCuInHeight)); + X265_FREAD(&analysis->lookahead, sizeof(x265_lookahead_data), 1, m_analysisFile, &(picData->lookahead)); } int scaledNumPartition = analysis->numPartitions; int factor = 1 << m_param->scaleFactor; if (m_param->scaleFactor) analysis->numPartitions *= factor; + /* Memory is allocated for inter and intra analysis data based on the slicetype */ allocAnalysis(analysis); - if (m_param->bDisableLookahead && m_rateControl->m_isVbv) + if (m_param->bDisableLookahead) { - X265_FREAD(analysis->lookahead.intraVbvCost, sizeof(uint32_t), analysis->numCUsInFrame, m_analysisFileIn, picData->lookahead.intraVbvCost); - X265_FREAD(analysis->lookahead.vbvCost, sizeof(uint32_t), analysis->numCUsInFrame, m_analysisFileIn, picData->lookahead.vbvCost); - X265_FREAD(analysis->lookahead.satdForVbv, sizeof(uint32_t), analysis->numCuInHeight, m_analysisFileIn, picData->lookahead.satdForVbv); - X265_FREAD(analysis->lookahead.intraSatdForVbv, sizeof(uint32_t), analysis->numCuInHeight, m_analysisFileIn, picData->lookahead.intraSatdForVbv); + X265_FREAD(analysis->lookahead.intraVbvCost, sizeof(uint32_t), analysis->numCUsInFrame, m_analysisFile, picData->lookahead.intraVbvCost); + X265_FREAD(analysis->lookahead.vbvCost, sizeof(uint32_t), analysis->numCUsInFrame, m_analysisFile, picData->lookahead.vbvCost); + X265_FREAD(analysis->lookahead.satdForVbv, sizeof(uint32_t), analysis->numCuInHeight, m_analysisFile, picData->lookahead.satdForVbv); + X265_FREAD(analysis->lookahead.intraSatdForVbv, sizeof(uint32_t), analysis->numCuInHeight, m_analysisFile, picData->lookahead.intraSatdForVbv); } if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I) { @@ -3357,9 +3321,9 @@ modeBuf = tempBuf + depthBytes; partSizes = tempBuf + 2 * depthBytes; - X265_FREAD(depthBuf, sizeof(uint8_t), depthBytes, m_analysisFileIn, intraPic->depth); - X265_FREAD(modeBuf, sizeof(uint8_t), depthBytes, m_analysisFileIn, intraPic->chromaModes); - X265_FREAD(partSizes, sizeof(uint8_t), depthBytes, m_analysisFileIn, intraPic->partSizes); + X265_FREAD(depthBuf, sizeof(uint8_t), depthBytes, m_analysisFile, intraPic->depth); + X265_FREAD(modeBuf, sizeof(uint8_t), depthBytes, m_analysisFile, intraPic->chromaModes); + X265_FREAD(partSizes, sizeof(uint8_t), depthBytes, m_analysisFile, intraPic->partSizes); size_t count = 0; for (uint32_t d = 0; d < depthBytes; d++) @@ -3380,12 +3344,12 @@ if (!m_param->scaleFactor) { - X265_FREAD(((analysis_intra_data *)analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFileIn, intraPic->modes); + X265_FREAD(((analysis_intra_data *)analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile, intraPic->modes); } else { uint8_t *tempLumaBuf = X265_MALLOC(uint8_t, analysis->numCUsInFrame * scaledNumPartition); - X265_FREAD(tempLumaBuf, sizeof(uint8_t), analysis->numCUsInFrame * scaledNumPartition, m_analysisFileIn, intraPic->modes); + X265_FREAD(tempLumaBuf, sizeof(uint8_t), analysis->numCUsInFrame * scaledNumPartition, m_analysisFile, intraPic->modes); for (uint32_t ctu32Idx = 0, cnt = 0; ctu32Idx < analysis->numCUsInFrame * scaledNumPartition; ctu32Idx++, cnt += factor) memset(&((analysis_intra_data *)analysis->intraData)->modes[cnt], tempLumaBuf[ctu32Idx], factor); X265_FREE(tempLumaBuf); @@ -3398,7 +3362,7 @@ { uint32_t numDir = analysis->sliceType == X265_TYPE_P ? 1 : 2; uint32_t numPlanes = m_param->internalCsp == X265_CSP_I400 ? 1 : 3; - X265_FREAD((WeightParam*)analysis->wt, sizeof(WeightParam), numPlanes * numDir, m_analysisFileIn, (picIn->analysisData.wt)); + X265_FREAD((WeightParam*)analysis->wt, sizeof(WeightParam), numPlanes * numDir, m_analysisFile, (picIn->analysisData.wt)); if (m_param->analysisReuseLevel < 2) return; @@ -3420,33 +3384,33 @@ depthBuf = tempBuf; modeBuf = tempBuf + depthBytes; - X265_FREAD(depthBuf, sizeof(uint8_t), depthBytes, m_analysisFileIn, interPic->depth); - X265_FREAD(modeBuf, sizeof(uint8_t), depthBytes, m_analysisFileIn, interPic->modes); + X265_FREAD(depthBuf, sizeof(uint8_t), depthBytes, m_analysisFile, interPic->depth); + X265_FREAD(modeBuf, sizeof(uint8_t), depthBytes, m_analysisFile, interPic->modes); if (m_param->analysisReuseLevel > 4) { partSize = modeBuf + depthBytes; mergeFlag = partSize + depthBytes; - X265_FREAD(partSize, sizeof(uint8_t), depthBytes, m_analysisFileIn, interPic->partSize); - X265_FREAD(mergeFlag, sizeof(uint8_t), depthBytes, m_analysisFileIn, interPic->mergeFlag); + X265_FREAD(partSize, sizeof(uint8_t), depthBytes, m_analysisFile, interPic->partSize); + X265_FREAD(mergeFlag, sizeof(uint8_t), depthBytes, m_analysisFile, interPic->mergeFlag); if (m_param->analysisReuseLevel == 10) { interDir = mergeFlag + depthBytes; - X265_FREAD(interDir, sizeof(uint8_t), depthBytes, m_analysisFileIn, interPic->interDir); + X265_FREAD(interDir, sizeof(uint8_t), depthBytes, m_analysisFile, interPic->interDir); if (bIntraInInter) { chromaDir = interDir + depthBytes; - X265_FREAD(chromaDir, sizeof(uint8_t), depthBytes, m_analysisFileIn, intraPic->chromaModes); + X265_FREAD(chromaDir, sizeof(uint8_t), depthBytes, m_analysisFile, intraPic->chromaModes); } for (uint32_t i = 0; i < numDir; i++) { mvpIdx[i] = X265_MALLOC(uint8_t, depthBytes); refIdx[i] = X265_MALLOC(int8_t, depthBytes); mv[i] = X265_MALLOC(MV, depthBytes); - X265_FREAD(mvpIdx[i], sizeof(uint8_t), depthBytes, m_analysisFileIn, interPic->mvpIdx[i]); - X265_FREAD(refIdx[i], sizeof(int8_t), depthBytes, m_analysisFileIn, interPic->refIdx[i]); - X265_FREAD(mv[i], sizeof(MV), depthBytes, m_analysisFileIn, interPic->mv[i]); + X265_FREAD(mvpIdx[i], sizeof(uint8_t), depthBytes, m_analysisFile, interPic->mvpIdx[i]); + X265_FREAD(refIdx[i], sizeof(int8_t), depthBytes, m_analysisFile, interPic->refIdx[i]); + X265_FREAD(mv[i], sizeof(MV), depthBytes, m_analysisFile, interPic->mv[i]); } } } @@ -3505,12 +3469,12 @@ { if (!m_param->scaleFactor) { - X265_FREAD(((analysis_intra_data *)analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFileIn, intraPic->modes); + X265_FREAD(((analysis_intra_data *)analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile, intraPic->modes); } else { uint8_t *tempLumaBuf = X265_MALLOC(uint8_t, analysis->numCUsInFrame * scaledNumPartition); - X265_FREAD(tempLumaBuf, sizeof(uint8_t), analysis->numCUsInFrame * scaledNumPartition, m_analysisFileIn, intraPic->modes); + X265_FREAD(tempLumaBuf, sizeof(uint8_t), analysis->numCUsInFrame * scaledNumPartition, m_analysisFile, intraPic->modes); for (uint32_t ctu32Idx = 0, cnt = 0; ctu32Idx < analysis->numCUsInFrame * scaledNumPartition; ctu32Idx++, cnt += factor) memset(&((analysis_intra_data *)analysis->intraData)->modes[cnt], tempLumaBuf[ctu32Idx], factor); X265_FREE(tempLumaBuf); @@ -3518,7 +3482,7 @@ } } else - X265_FREAD(((analysis_inter_data *)analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir, m_analysisFileIn, interPic->ref); + X265_FREAD(((analysis_inter_data *)analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir, m_analysisFile, interPic->ref); consumedBytes += frameRecordSize; if (numDir == 1) @@ -3793,51 +3757,51 @@ if (!m_param->bUseAnalysisFile) return; - X265_FWRITE(&analysis->frameRecordSize, sizeof(uint32_t), 1, m_analysisFileOut); - X265_FWRITE(&depthBytes, sizeof(uint32_t), 1, m_analysisFileOut); - X265_FWRITE(&analysis->poc, sizeof(int), 1, m_analysisFileOut); - X265_FWRITE(&analysis->sliceType, sizeof(int), 1, m_analysisFileOut); - X265_FWRITE(&analysis->bScenecut, sizeof(int), 1, m_analysisFileOut); - X265_FWRITE(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFileOut); - X265_FWRITE(&analysis->numCUsInFrame, sizeof(int), 1, m_analysisFileOut); - X265_FWRITE(&analysis->numPartitions, sizeof(int), 1, m_analysisFileOut); + X265_FWRITE(&analysis->frameRecordSize, sizeof(uint32_t), 1, m_analysisFile); + X265_FWRITE(&depthBytes, sizeof(uint32_t), 1, m_analysisFile); + X265_FWRITE(&analysis->poc, sizeof(int), 1, m_analysisFile); + X265_FWRITE(&analysis->sliceType, sizeof(int), 1, m_analysisFile); + X265_FWRITE(&analysis->bScenecut, sizeof(int), 1, m_analysisFile); + X265_FWRITE(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFile); + X265_FWRITE(&analysis->numCUsInFrame, sizeof(int), 1, m_analysisFile); + X265_FWRITE(&analysis->numPartitions, sizeof(int), 1, m_analysisFile); if (analysis->sliceType > X265_TYPE_I) - X265_FWRITE((WeightParam*)analysis->wt, sizeof(WeightParam), numPlanes * numDir, m_analysisFileOut); + X265_FWRITE((WeightParam*)analysis->wt, sizeof(WeightParam), numPlanes * numDir, m_analysisFile); if (m_param->analysisReuseLevel < 2) return; if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I) { - X265_FWRITE(((analysis_intra_data*)analysis->intraData)->depth, sizeof(uint8_t), depthBytes, m_analysisFileOut); - X265_FWRITE(((analysis_intra_data*)analysis->intraData)->chromaModes, sizeof(uint8_t), depthBytes, m_analysisFileOut); - X265_FWRITE(((analysis_intra_data*)analysis->intraData)->partSizes, sizeof(char), depthBytes, m_analysisFileOut); - X265_FWRITE(((analysis_intra_data*)analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFileOut); + X265_FWRITE(((analysis_intra_data*)analysis->intraData)->depth, sizeof(uint8_t), depthBytes, m_analysisFile); + X265_FWRITE(((analysis_intra_data*)analysis->intraData)->chromaModes, sizeof(uint8_t), depthBytes, m_analysisFile); + X265_FWRITE(((analysis_intra_data*)analysis->intraData)->partSizes, sizeof(char), depthBytes, m_analysisFile); + X265_FWRITE(((analysis_intra_data*)analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); } else { - X265_FWRITE(((analysis_inter_data*)analysis->interData)->depth, sizeof(uint8_t), depthBytes, m_analysisFileOut); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->modes, sizeof(uint8_t), depthBytes, m_analysisFileOut); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->depth, sizeof(uint8_t), depthBytes, m_analysisFile); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->modes, sizeof(uint8_t), depthBytes, m_analysisFile); if (m_param->analysisReuseLevel > 4) { - X265_FWRITE(((analysis_inter_data*)analysis->interData)->partSize, sizeof(uint8_t), depthBytes, m_analysisFileOut); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->mergeFlag, sizeof(uint8_t), depthBytes, m_analysisFileOut); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->partSize, sizeof(uint8_t), depthBytes, m_analysisFile); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->mergeFlag, sizeof(uint8_t), depthBytes, m_analysisFile); if (m_param->analysisReuseLevel == 10) { - X265_FWRITE(((analysis_inter_data*)analysis->interData)->interDir, sizeof(uint8_t), depthBytes, m_analysisFileOut); - if (bIntraInInter) X265_FWRITE(((analysis_intra_data*)analysis->intraData)->chromaModes, sizeof(uint8_t), depthBytes, m_analysisFileOut); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->interDir, sizeof(uint8_t), depthBytes, m_analysisFile); + if (bIntraInInter) X265_FWRITE(((analysis_intra_data*)analysis->intraData)->chromaModes, sizeof(uint8_t), depthBytes, m_analysisFile); for (uint32_t dir = 0; dir < numDir; dir++) { - X265_FWRITE(((analysis_inter_data*)analysis->interData)->mvpIdx[dir], sizeof(uint8_t), depthBytes, m_analysisFileOut); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->refIdx[dir], sizeof(int8_t), depthBytes, m_analysisFileOut); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->mv[dir], sizeof(MV), depthBytes, m_analysisFileOut); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->mvpIdx[dir], sizeof(uint8_t), depthBytes, m_analysisFile); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->refIdx[dir], sizeof(int8_t), depthBytes, m_analysisFile); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->mv[dir], sizeof(MV), depthBytes, m_analysisFile); } if (bIntraInInter) - X265_FWRITE(((analysis_intra_data*)analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFileOut); + X265_FWRITE(((analysis_intra_data*)analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); } } if (m_param->analysisReuseLevel != 10) - X265_FWRITE(((analysis_inter_data*)analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir, m_analysisFileOut); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir, m_analysisFile); } #undef X265_FWRITE
View file
x265_2.7.tar.gz/source/encoder/encoder.h -> x265_2.6.tar.gz/source/encoder/encoder.h
Changed
@@ -130,6 +130,7 @@ FrameEncoder* m_frameEncoder[X265_MAX_FRAME_THREADS]; DPB* m_dpb; Frame* m_exportedPic; + FILE* m_analysisFile; FILE* m_analysisFileIn; FILE* m_analysisFileOut; x265_param* m_param; @@ -207,7 +208,7 @@ int copySlicetypePocAndSceneCut(int *slicetype, int *poc, int *sceneCut); - int getRefFrameList(PicYuv** l0, PicYuv** l1, int sliceType, int poc, int* pocL0, int* pocL1); + int getRefFrameList(PicYuv** l0, PicYuv** l1, int sliceType, int poc); int setAnalysisDataAfterZScan(x265_analysis_data *analysis_data, Frame* curFrame);
View file
x265_2.7.tar.gz/source/encoder/frameencoder.cpp -> x265_2.6.tar.gz/source/encoder/frameencoder.cpp
Changed
@@ -335,13 +335,15 @@ while (!m_frame->m_ctuInfo) m_frame->m_copied.wait(); } - if ((m_param->bMVType == AVC_INFO) && !m_param->analysisSave && !m_param->analysisLoad && !(IS_X265_TYPE_I(m_frame->m_lowres.sliceType))) + if ((m_param->bMVType == AVC_INFO) && !m_param->analysisReuseMode && !(IS_X265_TYPE_I(m_frame->m_lowres.sliceType))) { while (((m_frame->m_analysisData.interData == NULL && m_frame->m_analysisData.intraData == NULL) || (uint32_t)m_frame->m_poc != m_frame->m_analysisData.poc)) m_frame->m_copyMVType.wait(); } compressFrame(); m_done.trigger(); /* FrameEncoder::getEncodedPicture() blocks for this event */ + if (m_frame != NULL) + m_frame->m_reconEncoded.trigger(); m_enable.wait(); } } @@ -430,7 +432,7 @@ bool bUseWeightB = slice->m_sliceType == B_SLICE && slice->m_pps->bUseWeightedBiPred; WeightParam* reuseWP = NULL; - if (m_param->analysisLoad && (bUseWeightP || bUseWeightB)) + if (m_param->analysisReuseMode && (bUseWeightP || bUseWeightB)) reuseWP = (WeightParam*)m_frame->m_analysisData.wt; if (bUseWeightP || bUseWeightB) @@ -439,7 +441,7 @@ m_cuStats.countWeightAnalyze++; ScopedElapsedTime time(m_cuStats.weightAnalyzeTime); #endif - if (m_param->analysisLoad) + if (m_param->analysisReuseMode == X265_ANALYSIS_LOAD) { for (int list = 0; list < slice->isInterB() + 1; list++) { @@ -466,8 +468,6 @@ else slice->disableWeights(); - if (m_param->analysisSave && (bUseWeightP || bUseWeightB)) - reuseWP = (WeightParam*)m_frame->m_analysisData.wt; // Generate motion references int numPredDir = slice->isInterP() ? 1 : slice->isInterB() ? 2 : 0; for (int l = 0; l < numPredDir; l++) @@ -480,7 +480,7 @@ slice->m_refReconPicList[l][ref] = slice->m_refFrameList[l][ref]->m_reconPic; m_mref[l][ref].init(slice->m_refReconPicList[l][ref], w, *m_param); } - if (m_param->analysisSave && (bUseWeightP || bUseWeightB)) + if (m_param->analysisReuseMode == X265_ANALYSIS_SAVE && (bUseWeightP || bUseWeightB)) { for (int i = 0; i < (m_param->internalCsp != X265_CSP_I400 ? 3 : 1); i++) *(reuseWP++) = slice->m_weightPredTable[l][0][i]; @@ -1413,7 +1413,7 @@ /* TODO: use defines from slicetype.h for lowres block size */ uint32_t block_y = (ctu->m_cuPelY >> m_param->maxLog2CUSize) * noOfBlocks; uint32_t block_x = (ctu->m_cuPelX >> m_param->maxLog2CUSize) * noOfBlocks; - if (!m_param->analysisLoad || !m_param->bDisableLookahead) + if (m_param->analysisReuseMode != X265_ANALYSIS_LOAD || !m_param->bDisableLookahead) { cuStat.vbvCost = 0; cuStat.intraVbvCost = 0; @@ -1748,8 +1748,8 @@ if (rowInSlice == rowCount) { m_rowSliceTotalBits[sliceId] = 0; - if (bIsVbv && !(m_param->rc.bEnableConstVbv && m_param->bEnableWavefront)) - { + if (bIsVbv) + { for (uint32_t i = m_sliceBaseRow[sliceId]; i < rowCount + m_sliceBaseRow[sliceId]; i++) m_rowSliceTotalBits[sliceId] += curEncData.m_rowStat[i].encodedBits; }
View file
x265_2.7.tar.gz/source/encoder/framefilter.cpp -> x265_2.6.tar.gz/source/encoder/framefilter.cpp
Changed
@@ -795,7 +795,7 @@ void FrameFilter::computeMEIntegral(int row) { int lastRow = row == (int)m_frame->m_encData->m_slice->m_sps->numCuInHeight - 1; - if (m_frame->m_lowres.sliceType != X265_TYPE_B) + if (m_frame->m_encData->m_meIntegral && m_frame->m_lowres.sliceType != X265_TYPE_B) { /* If WPP, other than first row, integral calculation for current row needs to wait till the * integral for the previous row is computed */
View file
x265_2.7.tar.gz/source/encoder/ratecontrol.cpp -> x265_2.6.tar.gz/source/encoder/ratecontrol.cpp
Changed
@@ -219,7 +219,6 @@ m_param->rc.vbvMaxBitrate = x265_clip3(0, 2000000, m_param->rc.vbvMaxBitrate); m_param->rc.vbvBufferInit = x265_clip3(0.0, 2000000.0, m_param->rc.vbvBufferInit); m_param->vbvBufferEnd = x265_clip3(0.0, 2000000.0, m_param->vbvBufferEnd); - m_initVbv = false; m_singleFrameVbv = 0; m_rateTolerance = 1.0; @@ -320,7 +319,7 @@ bool RateControl::init(const SPS& sps) { - if (m_isVbv && !m_initVbv) + if (m_isVbv) { /* We don't support changing the ABR bitrate right now, * so if the stream starts as CBR, keep it CBR. */ @@ -354,7 +353,6 @@ m_bufferFillFinal = m_bufferSize * m_param->rc.vbvBufferInit; m_bufferFillActual = m_bufferFillFinal; m_bufferExcess = 0; - m_initVbv = true; } m_totalBits = 0;
View file
x265_2.7.tar.gz/source/encoder/ratecontrol.h -> x265_2.6.tar.gz/source/encoder/ratecontrol.h
Changed
@@ -132,7 +132,6 @@ bool m_isGrainEnabled; bool m_isAbrReset; bool m_isNextGop; - bool m_initVbv; int m_lastAbrResetPoc; double m_rateTolerance;
View file
x265_2.7.tar.gz/source/encoder/sao.h -> x265_2.6.tar.gz/source/encoder/sao.h
Changed
@@ -55,9 +55,12 @@ enum { NUM_EDGETYPE = 5 }; enum { NUM_PLANE = 3 }; enum { SAO_DEPTHRATE_SIZE = 4 }; + static const uint32_t s_eoTable[NUM_EDGETYPE]; - typedef int32_t PerClass[MAX_NUM_SAO_TYPE][MAX_NUM_SAO_CLASS]; - typedef int32_t PerPlane[NUM_PLANE][MAX_NUM_SAO_TYPE][MAX_NUM_SAO_CLASS]; + + typedef int32_t (PerClass[MAX_NUM_SAO_TYPE][MAX_NUM_SAO_CLASS]); + typedef int32_t (PerPlane[NUM_PLANE][MAX_NUM_SAO_TYPE][MAX_NUM_SAO_CLASS]); + protected: /* allocated per part */
View file
x265_2.7.tar.gz/source/encoder/search.cpp -> x265_2.6.tar.gz/source/encoder/search.cpp
Changed
@@ -1947,7 +1947,7 @@ /* poc difference is out of range for lookahead */ return 0; - MV* mvs = m_frame->m_lowres.lowresMvs[list][diffPoc]; + MV* mvs = m_frame->m_lowres.lowresMvs[list][diffPoc - 1]; if (mvs[0].x == 0x7FFF) /* this motion search was not estimated by lookahead */ return 0; @@ -2073,7 +2073,7 @@ int mvpIdx = selectMVP(interMode.cu, pu, amvp, list, ref); MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx]; - if (!m_param->analysisSave && !m_param->analysisLoad) /* Prevents load/save outputs from diverging if lowresMV is not available */ + if (!m_param->analysisReuseMode) /* Prevents load/save outputs from diverging if lowresMV is not available */ { MV lmv = getLowresMV(interMode.cu, pu, list, ref); if (lmv.notZero()) @@ -2161,7 +2161,7 @@ cu.getNeighbourMV(puIdx, pu.puAbsPartIdx, interMode.interNeighbours); /* Uni-directional prediction */ - if ((m_param->analysisLoad && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel != 10) + if ((m_param->analysisReuseMode == X265_ANALYSIS_LOAD && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel != 10) || (m_param->analysisMultiPassRefine && m_param->rc.bStatRead) || (m_param->bMVType == AVC_INFO)) { for (int list = 0; list < numPredDir; list++) @@ -2297,7 +2297,7 @@ int mvpIdx = selectMVP(cu, pu, amvp, list, ref); MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx]; - if (!m_param->analysisSave && !m_param->analysisLoad) /* Prevents load/save outputs from diverging when lowresMV is not available */ + if (!m_param->analysisReuseMode) /* Prevents load/save outputs from diverging when lowresMV is not available */ { MV lmv = getLowresMV(cu, pu, list, ref); if (lmv.notZero())
View file
x265_2.7.tar.gz/source/encoder/slicetype.cpp -> x265_2.6.tar.gz/source/encoder/slicetype.cpp
Changed
@@ -154,7 +154,7 @@ int blockXY = 0; int blockX = 0, blockY = 0; double strength = 0.f; - if ((param->rc.aqMode == X265_AQ_NONE || param->rc.aqStrength == 0) || (param->rc.bStatRead && param->rc.cuTree && IS_REFERENCED(curFrame))) + if (param->rc.aqMode == X265_AQ_NONE || param->rc.aqStrength == 0) { /* Need to init it anyways for CU tree */ int cuCount = blockCount; @@ -589,7 +589,7 @@ m_outputSignalRequired = false; m_isActive = true; m_inputCount = 0; - m_extendGopBoundary = false; + m_8x8Height = ((m_param->sourceHeight / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; m_8x8Width = ((m_param->sourceWidth / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; m_cuCount = m_8x8Width * m_8x8Height; @@ -646,11 +646,7 @@ m_numRowsPerSlice = m_8x8Height; m_numCoopSlices = 1; } - if (param->gopLookahead && (param->gopLookahead > (param->lookaheadDepth - param->bframes - 2))) - { - param->gopLookahead = X265_MAX(0, param->lookaheadDepth - param->bframes - 2); - x265_log(param, X265_LOG_WARNING, "Gop-lookahead cannot be greater than (rc-lookahead - length of the mini-gop); Clipping gop-lookahead to %d\n", param->gopLookahead); - } + #if DETAILED_CU_STATS m_slicetypeDecideElapsedTime = 0; m_preLookaheadElapsedTime = 0; @@ -746,7 +742,7 @@ /* Called by API thread */ void Lookahead::addPicture(Frame& curFrame, int sliceType) { - if (m_param->analysisLoad && m_param->bDisableLookahead) + if (m_param->analysisReuseMode == X265_ANALYSIS_LOAD && m_param->bDisableLookahead) { if (!m_filled) m_filled = true; @@ -847,7 +843,7 @@ return out; } - if (m_param->analysisLoad && m_param->bDisableLookahead) + if (m_param->analysisReuseMode == X265_ANALYSIS_LOAD && m_param->bDisableLookahead) return NULL; findJob(-1); /* run slicetypeDecide() if necessary */ @@ -879,7 +875,7 @@ Slice *slice = curFrame->m_encData->m_slice; int p0 = 0, p1, b; int poc = slice->m_poc; - int l0poc = slice->m_rps.numberOfNegativePictures ? slice->m_refPOCList[0][0] : -1; + int l0poc = slice->m_refPOCList[0][0]; int l1poc = slice->m_refPOCList[1][0]; switch (slice->m_sliceType) @@ -896,34 +892,23 @@ break; case B_SLICE: - if (l0poc >= 0) - { - b = poc - l0poc; - p1 = b + l1poc - poc; - frames[p0] = &slice->m_refFrameList[0][0]->m_lowres; - frames[b] = &curFrame->m_lowres; - frames[p1] = &slice->m_refFrameList[1][0]->m_lowres; - } - else - { - p0 = b = 0; - p1 = b + l1poc - poc; - frames[p0] = frames[b] = &curFrame->m_lowres; - frames[p1] = &slice->m_refFrameList[1][0]->m_lowres; - } - + b = poc - l0poc; + p1 = b + l1poc - poc; + frames[p0] = &slice->m_refFrameList[0][0]->m_lowres; + frames[b] = &curFrame->m_lowres; + frames[p1] = &slice->m_refFrameList[1][0]->m_lowres; break; default: return; } - if (!m_param->analysisLoad || !m_param->bDisableLookahead) + if (m_param->analysisReuseMode != X265_ANALYSIS_LOAD || !m_param->bDisableLookahead) { X265_CHECK(curFrame->m_lowres.costEst[b - p0][p1 - b] > 0, "Slice cost not estimated\n") if (m_param->rc.cuTree && !m_param->rc.bStatRead) /* update row satds based on cutree offsets */ curFrame->m_lowres.satdCost = frameCostRecalculate(frames, p0, p1, b); - else if (!m_param->analysisLoad || m_param->scaleFactor) + else if (m_param->analysisReuseMode != X265_ANALYSIS_LOAD || m_param->scaleFactor) { if (m_param->rc.aqMode) curFrame->m_lowres.satdCost = curFrame->m_lowres.costEstAq[b - p0][p1 - b]; @@ -997,8 +982,11 @@ ProfileLookaheadTime(m_lookahead.m_preLookaheadElapsedTime, m_lookahead.m_countPreLookahead); ProfileScopeEvent(prelookahead); m_lock.release(); + preFrame->m_lowres.init(preFrame->m_fencPic, preFrame->m_poc); - if (m_lookahead.m_bAdaptiveQuant) + if (m_lookahead.m_param->rc.bStatRead && m_lookahead.m_param->rc.cuTree && IS_REFERENCED(preFrame)) + /* cu-tree offsets were read from stats file */; + else if (m_lookahead.m_bAdaptiveQuant) tld.calcAdaptiveQuantFrame(preFrame, m_lookahead.m_param); tld.lowresIntraEstimate(preFrame->m_lowres, m_lookahead.m_param->rc.qgSize); preFrame->m_lowresInit = true; @@ -1064,7 +1052,7 @@ { slicetypeAnalyse(frames, false); bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0; - if (m_param->analysisLoad && m_param->scaleFactor && bIsVbv) + if (m_param->analysisReuseMode == X265_ANALYSIS_LOAD && m_param->scaleFactor && bIsVbv) { int numFrames; for (numFrames = 0; numFrames < maxSearch; numFrames++) @@ -1098,8 +1086,7 @@ x265_log(m_param, X265_LOG_WARNING, "B-ref at frame %d incompatible with B-pyramid and %d reference frames\n", frm.sliceType, m_param->maxNumReferences); } - if ((!m_param->bIntraRefresh || frm.frameNum == 0) && frm.frameNum - m_lastKeyframe >= m_param->keyframeMax && - (!m_extendGopBoundary || frm.frameNum - m_lastKeyframe >= m_param->keyframeMax + m_param->gopLookahead)) + if ((!m_param->bIntraRefresh || frm.frameNum == 0) && frm.frameNum - m_lastKeyframe >= m_param->keyframeMax) { if (frm.sliceType == X265_TYPE_AUTO || frm.sliceType == X265_TYPE_I) frm.sliceType = m_param->bOpenGOP && m_lastKeyframe >= 0 ? X265_TYPE_I : X265_TYPE_IDR; @@ -1128,20 +1115,12 @@ /* Closed GOP */ m_lastKeyframe = frm.frameNum; frm.bKeyframe = true; - if (bframes > 0 && !m_param->radl) + if (bframes > 0) { list[bframes - 1]->m_lowres.sliceType = X265_TYPE_P; bframes--; } } - if (m_param->radl && !m_param->bOpenGOP && list[bframes + 1]) - { - if ((frm.frameNum - m_lastKeyframe) > (m_param->keyframeMax - m_param->radl - 1) && (frm.frameNum - m_lastKeyframe) < m_param->keyframeMax) - frm.sliceType = X265_TYPE_B; - if ((frm.frameNum - m_lastKeyframe) == (m_param->keyframeMax - m_param->radl - 1)) - frm.sliceType = X265_TYPE_P; - } - if (bframes == m_param->bframes || !list[bframes + 1]) { if (IS_X265_TYPE_B(frm.sliceType)) @@ -1191,13 +1170,8 @@ if (bframes) { p0 = 0; // last nonb - bool isp0available = frames[bframes + 1]->sliceType == X265_TYPE_IDR ? false : true; - for (b = 1; b <= bframes; b++) { - if (!isp0available) - p0 = b; - if (frames[b]->sliceType == X265_TYPE_B) for (p1 = b; frames[p1]->sliceType == X265_TYPE_B; p1++) ; // find new nonb or bref @@ -1207,10 +1181,7 @@ estGroup.singleCost(p0, p1, b); if (frames[b]->sliceType == X265_TYPE_BREF) - { p0 = b; - isp0available = true; - } } } } @@ -1234,8 +1205,9 @@ int idx = 0; list[bframes]->m_reorderedPts = pts[idx++]; m_outputQueue.pushBack(*list[bframes]); + /* Add B-ref frame next to P frame in output queue, the B-ref encode before non B-ref frame */ - if (brefs) + if (bframes > 1 && m_param->bBPyramid) { for (int i = 0; i < bframes; i++) { @@ -1275,7 +1247,7 @@ frames[j + 1] = NULL; slicetypeAnalyse(frames, true); bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0; - if (m_param->analysisLoad && m_param->scaleFactor && bIsVbv) + if (m_param->analysisReuseMode == X265_ANALYSIS_LOAD && m_param->scaleFactor && bIsVbv) { int numFrames; for (numFrames = 0; numFrames < maxSearch; numFrames++) @@ -1405,14 +1377,12 @@ cuTree(frames, 0, bKeyframe); return; } + frames[framecnt + 1] = NULL; - int keyFrameLimit = m_param->keyframeMax + m_lastKeyframe - frames[0]->frameNum - 1; - if (m_param->gopLookahead && keyFrameLimit <= m_param->bframes + 1) - keyintLimit = keyFrameLimit + m_param->gopLookahead; - else - keyintLimit = keyFrameLimit; + keyintLimit = m_param->keyframeMax - frames[0]->frameNum + m_lastKeyframe - 1; origNumFrames = numFrames = m_param->bIntraRefresh ? framecnt : X265_MIN(framecnt, keyintLimit); + if (bIsVbvLookahead) numFrames = framecnt; else if (m_param->bOpenGOP && numFrames < framecnt) @@ -1436,12 +1406,12 @@ continue; /* Skip search if already done */ - if (frames[b]->lowresMvs[0][i][0].x != 0x7FFF) + if (frames[b]->lowresMvs[0][i - 1][0].x != 0x7FFF) continue; /* perform search to p1 at same distance, if possible */ int p1 = b + i; - if (p1 >= numFrames || frames[b]->lowresMvs[1][i][0].x != 0x7FFF) + if (p1 >= numFrames || frames[b]->lowresMvs[1][i - 1][0].x != 0x7FFF) p1 = b; estGroup.add(p0, p1, b); @@ -1463,7 +1433,7 @@ /* only measure frame cost in this pass if motion searches * are already done */ - if (frames[b]->lowresMvs[0][i][0].x == 0x7FFF) + if (frames[b]->lowresMvs[0][i - 1][0].x == 0x7FFF) continue; int p0 = b - i; @@ -1475,7 +1445,7 @@ break; /* ensure P1 search is done */ - if (j && frames[b]->lowresMvs[1][j][0].x == 0x7FFF) + if (j && frames[b]->lowresMvs[1][j - 1][0].x == 0x7FFF) continue; /* ensure frame cost is not done */ @@ -1502,26 +1472,7 @@ frames[1]->sliceType = X265_TYPE_I; return; } - if (m_param->gopLookahead && (keyFrameLimit >= 0) && (keyFrameLimit <= m_param->bframes + 1)) - { - bool sceneTransition = m_isSceneTransition; - m_extendGopBoundary = false; - for (int i = m_param->bframes + 1; i < origNumFrames; i += m_param->bframes + 1) - { - scenecut(frames, i, i + 1, true, origNumFrames); - for (int j = i + 1; j <= X265_MIN(i + m_param->bframes + 1, origNumFrames); j++) - { - if (frames[j]->bScenecut && scenecutInternal(frames, j - 1, j, true) ) - { - m_extendGopBoundary = true; - break; - } - } - if (m_extendGopBoundary) - break; - } - m_isSceneTransition = sceneTransition; - } + if (m_param->bframes) { if (m_param->bFrameAdaptive == X265_B_ADAPT_TRELLIS) @@ -1627,8 +1578,6 @@ if (m_param->rc.cuTree) cuTree(frames, X265_MIN(numFrames, m_param->keyframeMax), bKeyframe); - if (m_param->gopLookahead && (keyFrameLimit >= 0) && (keyFrameLimit <= m_param->bframes + 1) && !m_extendGopBoundary) - keyintLimit = keyFrameLimit; if (!m_param->bIntraRefresh) for (int j = keyintLimit + 1; j <= numFrames; j += m_param->keyframeMax) @@ -1639,8 +1588,8 @@ if (bIsVbvLookahead) vbvLookahead(frames, numFrames, bKeyframe); - int maxp1 = X265_MIN(m_param->bframes + 1, origNumFrames); + int maxp1 = X265_MIN(m_param->bframes + 1, origNumFrames); /* Restore frame types for all frames that haven't actually been decided yet. */ for (int j = resetStart; j <= numFrames; j++) { @@ -1664,8 +1613,8 @@ bool fluctuate = false; bool noScenecuts = false; int64_t avgSatdCost = 0; - if (frames[p0]->costEst[p1 - p0][0] > -1) - avgSatdCost = frames[p0]->costEst[p1 - p0][0]; + if (frames[0]->costEst[1][0] > -1) + avgSatdCost = frames[0]->costEst[1][0]; int cnt = 1; /* Where A and B are scenes: AAAAAABBBAAAAAA * If BBB is shorter than (maxp1-p0), it is detected as a flash @@ -1751,10 +1700,12 @@ CostEstimateGroup estGroup(*this, frames); estGroup.singleCost(p0, p1, p1); + int64_t icost = frame->costEst[0][0]; int64_t pcost = frame->costEst[p1 - p0][0]; - int gopSize = (frame->frameNum - m_lastKeyframe) % m_param->keyframeMax; + int gopSize = frame->frameNum - m_lastKeyframe; float threshMax = (float)(m_param->scenecutThreshold / 100.0); + /* magic numbers pulled out of thin air */ float threshMin = (float)(threshMax * 0.25); double bias = m_param->scenecutBias; @@ -1890,7 +1841,7 @@ void Lookahead::calcMotionAdaptiveQuantFrame(Lowres **frames, int p0, int p1, int b) { - int listDist[2] = { b - p0, p1 - b }; + int listDist[2] = { b - p0 - 1, p1 - b - 1 }; int32_t strideInCU = m_8x8Width; double qp_adj = 0, avg_adj = 0, avg_adj_pow2 = 0, sd; for (uint16_t blocky = 0; blocky < m_8x8Height; blocky++) @@ -2053,7 +2004,7 @@ int32_t distScaleFactor = (((b - p0) << 8) + ((p1 - p0) >> 1)) / (p1 - p0); int32_t bipredWeight = m_param->bEnableWeightedBiPred ? 64 - (distScaleFactor >> 2) : 32; int32_t bipredWeights[2] = { bipredWeight, 64 - bipredWeight }; - int listDist[2] = { b - p0, p1 - b }; + int listDist[2] = { b - p0 - 1, p1 - b - 1 }; memset(m_scratch, 0, m_8x8Width * sizeof(int)); @@ -2328,15 +2279,17 @@ score = fenc->costEst[b - p0][p1 - b]; else { + X265_CHECK(p0 != b, "I frame estimates should always be pre-calculated\n"); + bool bDoSearch[2]; - bDoSearch[0] = fenc->lowresMvs[0][b - p0][0].x == 0x7FFF; - bDoSearch[1] = p1 > b && fenc->lowresMvs[1][p1 - b][0].x == 0x7FFF; + bDoSearch[0] = p0 < b && fenc->lowresMvs[0][b - p0 - 1][0].x == 0x7FFF; + bDoSearch[1] = p1 > b && fenc->lowresMvs[1][p1 - b - 1][0].x == 0x7FFF; #if CHECKED_BUILD - X265_CHECK(!(p0 < b && fenc->lowresMvs[0][b - p0][0].x == 0x7FFE), "motion search batch duplication L0\n"); - X265_CHECK(!(p1 > b && fenc->lowresMvs[1][p1 - b][0].x == 0x7FFE), "motion search batch duplication L1\n"); - if (bDoSearch[0]) fenc->lowresMvs[0][b - p0][0].x = 0x7FFE; - if (bDoSearch[1]) fenc->lowresMvs[1][p1 - b][0].x = 0x7FFE; + X265_CHECK(!(p0 < b && fenc->lowresMvs[0][b - p0 - 1][0].x == 0x7FFE), "motion search batch duplication L0\n"); + X265_CHECK(!(p1 > b && fenc->lowresMvs[1][p1 - b - 1][0].x == 0x7FFE), "motion search batch duplication L1\n"); + if (bDoSearch[0]) fenc->lowresMvs[0][b - p0 - 1][0].x = 0x7FFE; + if (bDoSearch[1]) fenc->lowresMvs[1][p1 - b - 1][0].x = 0x7FFE; #endif fenc->weightedRef[b - p0].isWeighted = false; @@ -2427,7 +2380,7 @@ /* A small, arbitrary bias to avoid VBV problems caused by zero-residual lookahead blocks. */ int lowresPenalty = 4; - int listDist[2] = { b - p0, p1 - b}; + int listDist[2] = { b - p0 - 1, p1 - b - 1 }; MV mvmin, mvmax; int bcost = tld.me.COST_MAX;
View file
x265_2.7.tar.gz/source/encoder/slicetype.h -> x265_2.6.tar.gz/source/encoder/slicetype.h
Changed
@@ -132,7 +132,6 @@ bool m_filled; bool m_isSceneTransition; int m_numPools; - bool m_extendGopBoundary; Lookahead(x265_param *param, ThreadPool *pool); #if DETAILED_CU_STATS int64_t m_slicetypeDecideElapsedTime;
View file
x265_2.7.tar.gz/source/encoder/weightPrediction.cpp -> x265_2.6.tar.gz/source/encoder/weightPrediction.cpp
Changed
@@ -323,7 +323,7 @@ if (!plane && diffPoc <= param.bframes + 1) { - mvs = fenc.lowresMvs[list][diffPoc]; + mvs = fenc.lowresMvs[list][diffPoc - 1]; /* test whether this motion search was performed by lookahead */ if (mvs[0].x != 0x7FFF)
View file
x265_2.7.tar.gz/source/input/y4m.cpp -> x265_2.6.tar.gz/source/input/y4m.cpp
Changed
@@ -20,8 +20,7 @@ * This program is also available under a commercial proprietary license. * For more information, contact us at license @ x265.com. *****************************************************************************/ -#define _FILE_OFFSET_BITS 64 -#define _LARGEFILE_SOURCE + #include "y4m.h" #include "common.h" @@ -39,7 +38,9 @@ using namespace X265_NS; using namespace std; -static const char header[] = {'F','R','A','M','E'}; + +static const char header[] = "FRAME"; + Y4MInput::Y4MInput(InputFileInfo& info) { for (int i = 0; i < QUEUE_SIZE; i++) @@ -59,14 +60,15 @@ ifs = NULL; if (!strcmp(info.filename, "-")) { - ifs = stdin; + ifs = &cin; #if _WIN32 setmode(fileno(stdin), O_BINARY); #endif } else - ifs = x265_fopen(info.filename, "rb"); - if (ifs && !ferror(ifs) && parseHeader()) + ifs = new ifstream(info.filename, ios::binary | ios::in); + + if (ifs && ifs->good() && parseHeader()) { int pixelbytes = depth > 8 ? 2 : 1; for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++) @@ -89,8 +91,8 @@ } if (!threadActive) { - if (ifs && ifs != stdin) - fclose(ifs); + if (ifs && ifs != &cin) + delete ifs; ifs = NULL; return; } @@ -104,34 +106,61 @@ info.csp = colorSpace; info.depth = depth; info.frameCount = -1; - size_t estFrameSize = framesize + sizeof(header) + 1; /* assume basic FRAME\n headers */ + + size_t estFrameSize = framesize + strlen(header) + 1; /* assume basic FRAME\n headers */ + /* try to estimate frame count, if this is not stdin */ - if (ifs != stdin) + if (ifs != &cin) { - int64_t cur = ftello(ifs); + istream::pos_type cur = ifs->tellg(); + +#if defined(_MSC_VER) && _MSC_VER < 1700 + /* Older MSVC versions cannot handle 64bit file sizes properly, so go native */ + HANDLE hFile = CreateFileA(info.filename, GENERIC_READ, + FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, + FILE_ATTRIBUTE_NORMAL, NULL); + if (hFile != INVALID_HANDLE_VALUE) + { + LARGE_INTEGER size; + if (GetFileSizeEx(hFile, &size)) + info.frameCount = (int)((size.QuadPart - (int64_t)cur) / estFrameSize); + CloseHandle(hFile); + } +#else // if defined(_MSC_VER) && _MSC_VER < 1700 if (cur >= 0) { - fseeko(ifs, 0, SEEK_END); - int64_t size = ftello(ifs); - fseeko(ifs, cur, SEEK_SET); + ifs->seekg(0, ios::end); + istream::pos_type size = ifs->tellg(); + ifs->seekg(cur, ios::beg); if (size > 0) info.frameCount = (int)((size - cur) / estFrameSize); } +#endif // if defined(_MSC_VER) && _MSC_VER < 1700 } + if (info.skipFrames) { - if (ifs != stdin) - fseeko(ifs, (int64_t)estFrameSize * info.skipFrames, SEEK_CUR); +#if X86_64 + if (ifs != &cin) + ifs->seekg((uint64_t)estFrameSize * info.skipFrames, ios::cur); else for (int i = 0; i < info.skipFrames; i++) - if (fread(buf[0], estFrameSize - framesize, 1, ifs) + fread(buf[0], framesize, 1, ifs) != 2) - break; + { + ifs->read(buf[0], estFrameSize - framesize); + ifs->read(buf[0], framesize); + } +#else + for (int i = 0; i < info.skipFrames; i++) + ifs->ignore(estFrameSize); +#endif } } + Y4MInput::~Y4MInput() { - if (ifs && ifs != stdin) - fclose(ifs); + if (ifs && ifs != &cin) + delete ifs; + for (int i = 0; i < QUEUE_SIZE; i++) X265_FREE(buf[i]); } @@ -151,31 +180,37 @@ int csp = 0; int d = 0; - int c; - while ((c = fgetc(ifs)) != EOF) + + while (ifs->good()) { // Skip Y4MPEG string - while ((c != EOF) && (c != ' ') && (c != '\n')) - c = fgetc(ifs); - while (c == ' ') + int c = ifs->get(); + while (ifs->good() && (c != ' ') && (c != '\n')) + c = ifs->get(); + + while (c == ' ' && ifs->good()) { // read parameter identifier - switch (fgetc(ifs)) + switch (ifs->get()) { case 'W': width = 0; - while ((c = fgetc(ifs)) != EOF) + while (ifs->good()) { + c = ifs->get(); + if (c == ' ' || c == '\n') break; else width = width * 10 + (c - '0'); } break; + case 'H': height = 0; - while ((c = fgetc(ifs)) != EOF) + while (ifs->good()) { + c = ifs->get(); if (c == ' ' || c == '\n') break; else @@ -186,13 +221,15 @@ case 'F': rateNum = 0; rateDenom = 0; - while ((c = fgetc(ifs)) != EOF) + while (ifs->good()) { + c = ifs->get(); if (c == '.') { rateDenom = 1; - while ((c = fgetc(ifs)) != EOF) + while (ifs->good()) { + c = ifs->get(); if (c == ' ' || c == '\n') break; else @@ -205,8 +242,9 @@ } else if (c == ':') { - while ((c = fgetc(ifs)) != EOF) + while (ifs->good()) { + c = ifs->get(); if (c == ' ' || c == '\n') break; else @@ -222,12 +260,14 @@ case 'A': sarWidth = 0; sarHeight = 0; - while ((c = fgetc(ifs)) != EOF) + while (ifs->good()) { + c = ifs->get(); if (c == ':') { - while ((c = fgetc(ifs)) != EOF) + while (ifs->good()) { + c = ifs->get(); if (c == ' ' || c == '\n') break; else @@ -243,15 +283,19 @@ case 'C': csp = 0; d = 0; - while ((c = fgetc(ifs)) != EOF) + while (ifs->good()) { + c = ifs->get(); + if (c <= 'o' && c >= '0') csp = csp * 10 + (c - '0'); else if (c == 'p') { // example: C420p16 - while ((c = fgetc(ifs)) != EOF) + while (ifs->good()) { + c = ifs->get(); + if (c <= '9' && c >= '0') d = d * 10 + (c - '0'); else @@ -284,10 +328,12 @@ if (d >= 8 && d <= 16) depth = d; break; + default: - while ((c = fgetc(ifs)) != EOF) + while (ifs->good()) { // consume this unsupported configuration word + c = ifs->get(); if (c == ' ' || c == '\n') break; } @@ -329,23 +375,30 @@ threadActive = false; writeCount.poke(); } + bool Y4MInput::populateFrameQueue() { - if (!ifs || ferror(ifs)) + if (!ifs || ifs->fail()) return false; - /* strip off the FRAME\n header */ - char hbuf[sizeof(header) + 1]; - if (fread(hbuf, sizeof(hbuf), 1, ifs) != 1 || memcmp(hbuf, header, sizeof(header))) + + /* strip off the FRAME header */ + char hbuf[sizeof(header)]; + + ifs->read(hbuf, strlen(header)); + if (ifs->eof()) + return false; + + if (!ifs->good() || memcmp(hbuf, header, strlen(header))) { - if (!feof(ifs)) - x265_log(NULL, X265_LOG_ERROR, "y4m: frame header missing\n"); + x265_log(NULL, X265_LOG_ERROR, "y4m: frame header missing\n"); return false; } + /* consume bytes up to line feed */ - int c = hbuf[sizeof(header)]; - while (c != '\n') - if ((c = fgetc(ifs)) == EOF) - break; + int c = ifs->get(); + while (c != '\n' && ifs->good()) + c = ifs->get(); + /* wait for room in the ring buffer */ int written = writeCount.get(); int read = readCount.get(); @@ -355,8 +408,10 @@ if (!threadActive) return false; } + ProfileScopeEvent(frameRead); - if (fread(buf[written % QUEUE_SIZE], framesize, 1, ifs) == 1) + ifs->read(buf[written % QUEUE_SIZE], framesize); + if (ifs->good()) { writeCount.incr(); return true;
View file
x265_2.7.tar.gz/source/input/y4m.h -> x265_2.6.tar.gz/source/input/y4m.h
Changed
@@ -60,9 +60,13 @@ ThreadSafeInteger readCount; ThreadSafeInteger writeCount; + char* buf[QUEUE_SIZE]; - FILE *ifs; + + std::istream *ifs; + bool parseHeader(); + void threadMain(); bool populateFrameQueue(); @@ -72,10 +76,15 @@ Y4MInput(InputFileInfo& info); virtual ~Y4MInput(); + void release(); - bool isEof() const { return ifs && feof(ifs); } - bool isFail() { return !(ifs && !ferror(ifs) && threadActive); } + + bool isEof() const { return ifs && ifs->eof(); } + + bool isFail() { return !(ifs && !ifs->fail() && threadActive); } + void startReader(); + bool readPicture(x265_picture&); const char *getName() const { return "y4m"; }
View file
x265_2.7.tar.gz/source/input/yuv.cpp -> x265_2.6.tar.gz/source/input/yuv.cpp
Changed
@@ -20,8 +20,7 @@ * This program is also available under a commercial proprietary license. * For more information, contact us at license @ x265.com. *****************************************************************************/ -#define _FILE_OFFSET_BITS 64 -#define _LARGEFILE_SOURCE + #include "yuv.h" #include "common.h" @@ -66,21 +65,23 @@ x265_log(NULL, X265_LOG_ERROR, "yuv: width, height, and FPS must be specified\n"); return; } + if (!strcmp(info.filename, "-")) { - ifs = stdin; + ifs = &cin; #if _WIN32 setmode(fileno(stdin), O_BINARY); #endif } else - ifs = x265_fopen(info.filename, "rb"); - if (ifs && !ferror(ifs)) + ifs = new ifstream(info.filename, ios::binary | ios::in); + + if (ifs && ifs->good()) threadActive = true; else { - if (ifs && ifs != stdin) - fclose(ifs); + if (ifs && ifs != &cin) + delete ifs; ifs = NULL; return; } @@ -97,33 +98,55 @@ } info.frameCount = -1; + /* try to estimate frame count, if this is not stdin */ - if (ifs != stdin) + if (ifs != &cin) { - int64_t cur = ftello(ifs); + istream::pos_type cur = ifs->tellg(); + +#if defined(_MSC_VER) && _MSC_VER < 1700 + /* Older MSVC versions cannot handle 64bit file sizes properly, so go native */ + HANDLE hFile = CreateFileA(info.filename, GENERIC_READ, + FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, + FILE_ATTRIBUTE_NORMAL, NULL); + if (hFile != INVALID_HANDLE_VALUE) + { + LARGE_INTEGER size; + if (GetFileSizeEx(hFile, &size)) + info.frameCount = (int)((size.QuadPart - (int64_t)cur) / framesize); + CloseHandle(hFile); + } +#else // if defined(_MSC_VER) && _MSC_VER < 1700 if (cur >= 0) { - fseeko(ifs, 0, SEEK_END); - int64_t size = ftello(ifs); - fseeko(ifs, cur, SEEK_SET); + ifs->seekg(0, ios::end); + istream::pos_type size = ifs->tellg(); + ifs->seekg(cur, ios::beg); if (size > 0) info.frameCount = (int)((size - cur) / framesize); } +#endif // if defined(_MSC_VER) && _MSC_VER < 1700 } + if (info.skipFrames) { - if (ifs != stdin) - fseeko(ifs, (int64_t)framesize * info.skipFrames, SEEK_CUR); +#if X86_64 + if (ifs != &cin) + ifs->seekg((uint64_t)framesize * info.skipFrames, ios::cur); else for (int i = 0; i < info.skipFrames; i++) - if (fread(buf[0], framesize, 1, ifs) != 1) - break; + ifs->read(buf[0], framesize); +#else + for (int i = 0; i < info.skipFrames; i++) + ifs->ignore(framesize); +#endif } } + YUVInput::~YUVInput() { - if (ifs && ifs != stdin) - fclose(ifs); + if (ifs && ifs != &cin) + delete ifs; for (int i = 0; i < QUEUE_SIZE; i++) X265_FREE(buf[i]); } @@ -156,10 +179,12 @@ threadActive = false; writeCount.poke(); } + bool YUVInput::populateFrameQueue() { - if (!ifs || ferror(ifs)) + if (!ifs || ifs->fail()) return false; + /* wait for room in the ring buffer */ int written = writeCount.get(); int read = readCount.get(); @@ -170,8 +195,10 @@ // release() has been called return false; } + ProfileScopeEvent(frameRead); - if (fread(buf[written % QUEUE_SIZE], framesize, 1, ifs) == 1) + ifs->read(buf[written % QUEUE_SIZE], framesize); + if (ifs->good()) { writeCount.incr(); return true;
View file
x265_2.7.tar.gz/source/input/yuv.h -> x265_2.6.tar.gz/source/input/yuv.h
Changed
@@ -52,9 +52,13 @@ ThreadSafeInteger readCount; ThreadSafeInteger writeCount; + char* buf[QUEUE_SIZE]; - FILE *ifs; + + std::istream *ifs; + int guessFrameCount(); + void threadMain(); bool populateFrameQueue(); @@ -64,9 +68,13 @@ YUVInput(InputFileInfo& info); virtual ~YUVInput(); + void release(); - bool isEof() const { return ifs && feof(ifs); } - bool isFail() { return !(ifs && !ferror(ifs) && threadActive); } + + bool isEof() const { return ifs && ifs->eof(); } + + bool isFail() { return !(ifs && !ifs->fail() && threadActive); } + void startReader(); bool readPicture(x265_picture&);
View file
x265_2.7.tar.gz/source/output/raw.cpp -> x265_2.6.tar.gz/source/output/raw.cpp
Changed
@@ -21,26 +21,18 @@ * This program is also available under a commercial proprietary license. * For more information, contact us at license @ x265.com. *****************************************************************************/ + #include "raw.h" -#if _WIN32 -#include <io.h> -#include <fcntl.h> -#if defined(_MSC_VER) -#pragma warning(disable: 4996) // POSIX setmode and fileno deprecated -#endif -#endif using namespace X265_NS; using namespace std; + RAWOutput::RAWOutput(const char* fname, InputFileInfo&) { b_fail = false; if (!strcmp(fname, "-")) { ofs = stdout; -#if _WIN32 - setmode(fileno(stdout), O_BINARY); -#endif return; } ofs = x265_fopen(fname, "wb");
View file
x265_2.7.tar.gz/source/test/CMakeLists.txt -> x265_2.6.tar.gz/source/test/CMakeLists.txt
Changed
@@ -7,37 +7,37 @@ # add X86 assembly files if(X86) -enable_language(ASM_NASM) +enable_language(ASM_YASM) if(MSVC_IDE) - set(NASM_SRC checkasm-a.obj) + set(YASM_SRC checkasm-a.obj) add_custom_command( OUTPUT checkasm-a.obj - COMMAND ${NASM_EXECUTABLE} - ARGS ${NASM_FLAGS} ${CMAKE_CURRENT_SOURCE_DIR}/checkasm-a.asm -o checkasm-a.obj + COMMAND ${YASM_EXECUTABLE} + ARGS ${YASM_FLAGS} ${CMAKE_CURRENT_SOURCE_DIR}/checkasm-a.asm -o checkasm-a.obj DEPENDS checkasm-a.asm) else() - set(NASM_SRC checkasm-a.asm) + set(YASM_SRC checkasm-a.asm) endif() endif(X86) # add ARM assembly files if(ARM OR CROSS_COMPILE_ARM) enable_language(ASM) - set(NASM_SRC checkasm-arm.S) + set(YASM_SRC checkasm-arm.S) add_custom_command( OUTPUT checkasm-arm.obj COMMAND ${CMAKE_CXX_COMPILER} - ARGS ${NASM_FLAGS} ${CMAKE_CURRENT_SOURCE_DIR}/checkasm-arm.S -o checkasm-arm.obj + ARGS ${YASM_FLAGS} ${CMAKE_CURRENT_SOURCE_DIR}/checkasm-arm.S -o checkasm-arm.obj DEPENDS checkasm-arm.S) endif(ARM OR CROSS_COMPILE_ARM) # add PowerPC assembly files if(POWER) - set(NASM_SRC) + set(YASM_SRC) endif(POWER) -add_executable(TestBench ${NASM_SRC} +add_executable(TestBench ${YASM_SRC} testbench.cpp testharness.h pixelharness.cpp pixelharness.h mbdstharness.cpp mbdstharness.h
View file
x265_2.7.tar.gz/source/test/checkasm-a.asm -> x265_2.6.tar.gz/source/test/checkasm-a.asm
Changed
@@ -26,7 +26,7 @@ ;* For more information, contact us at license @ x265.com. ;***************************************************************************** -%include "x86inc.asm" +%include "../common/x86/x86inc.asm" SECTION_RODATA @@ -35,24 +35,24 @@ %if ARCH_X86_64 ; just random numbers to reduce the chance of incidental match ALIGN 16 -x6: dq 0x1a1b2550a612b48c,0x79445c159ce79064 -x7: dq 0x2eed899d5a28ddcd,0x86b2536fcd8cf636 -x8: dq 0xb0856806085e7943,0x3f2bf84fc0fcca4e -x9: dq 0xacbd382dcf5b8de2,0xd229e1f5b281303f -x10: dq 0x71aeaff20b095fd9,0xab63e2e11fa38ed9 -x11: dq 0x89b0c0765892729a,0x77d410d5c42c882d -x12: dq 0xc45ea11a955d8dd5,0x24b3c1d2a024048b -x13: dq 0x2e8ec680de14b47c,0xdd7b8919edd42786 -x14: dq 0x135ce6888fa02cbf,0x11e53e2b2ac655ef -x15: dq 0x011ff554472a7a10,0x6de8f4c914c334d5 -n7: dq 0x21f86d66c8ca00ce -n8: dq 0x75b6ba21077c48ad -n9: dq 0xed56bb2dcb3c7736 -n10: dq 0x8bda43d3fd1a7e06 -n11: dq 0xb64a9c9e5d318408 -n12: dq 0xdf9a54b303f1d3a3 -n13: dq 0x4a75479abd64e097 -n14: dq 0x249214109d5d1c88 +x6: ddq 0x79445c159ce790641a1b2550a612b48c +x7: ddq 0x86b2536fcd8cf6362eed899d5a28ddcd +x8: ddq 0x3f2bf84fc0fcca4eb0856806085e7943 +x9: ddq 0xd229e1f5b281303facbd382dcf5b8de2 +x10: ddq 0xab63e2e11fa38ed971aeaff20b095fd9 +x11: ddq 0x77d410d5c42c882d89b0c0765892729a +x12: ddq 0x24b3c1d2a024048bc45ea11a955d8dd5 +x13: ddq 0xdd7b8919edd427862e8ec680de14b47c +x14: ddq 0x11e53e2b2ac655ef135ce6888fa02cbf +x15: ddq 0x6de8f4c914c334d5011ff554472a7a10 +n7: dq 0x21f86d66c8ca00ce +n8: dq 0x75b6ba21077c48ad +n9: dq 0xed56bb2dcb3c7736 +n10: dq 0x8bda43d3fd1a7e06 +n11: dq 0xb64a9c9e5d318408 +n12: dq 0xdf9a54b303f1d3a3 +n13: dq 0x4a75479abd64e097 +n14: dq 0x249214109d5d1c88 %endif SECTION .text @@ -70,14 +70,14 @@ ;----------------------------------------------------------------------------- cglobal checkasm_stack_clobber, 1,2 ; Clobber the stack with junk below the stack pointer - %define argsize (max_args+6)*8 - SUB rsp, argsize - mov r1, argsize-8 + %define size (max_args+6)*8 + SUB rsp, size + mov r1, size-8 .loop: mov [rsp+r1], r0 sub r1, 8 jge .loop - ADD rsp, argsize + ADD rsp, size RET %if WIN64 @@ -156,11 +156,7 @@ mov r9, rax mov r10, rdx lea r0, [error_message] -%if FORMAT_ELF - call puts wrt ..plt -%else call puts -%endif mov r1, [rsp+max_args*8] mov dword [r1], 0 mov rdx, r10
View file
x265_2.7.tar.gz/source/test/regression-tests.txt -> x265_2.6.tar.gz/source/test/regression-tests.txt
Changed
@@ -18,17 +18,17 @@ BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190 --slices 3 BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 16 --cu-lossless --tu-inter-depth 3 --limit-tu 1 BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao -BasketballDrive_1920x1080_50.y4m,--preset medium --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 2 --bitrate 7000 --limit-modes::--preset medium --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 2 --bitrate 7000 --limit-modes +BasketballDrive_1920x1080_50.y4m,--preset medium --no-cutree --analysis-reuse-mode=save --analysis-reuse-level 2 --bitrate 7000 --limit-modes::--preset medium --no-cutree --analysis-reuse-mode=load --analysis-reuse-level 2 --bitrate 7000 --limit-modes BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 --qg-size 16 --limit-refs 1 BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0 --limit-tu 4 -BasketballDrive_1920x1080_50.y4m,--preset slower --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 10 --bitrate 7000 --limit-tu 0::--preset slower --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 10 --bitrate 7000 --limit-tu 0 +BasketballDrive_1920x1080_50.y4m,--preset slower --no-cutree --analysis-reuse-mode=save --analysis-reuse-level 10 --bitrate 7000 --limit-tu 0::--preset slower --no-cutree --analysis-reuse-mode=load --analysis-reuse-level 10 --bitrate 7000 --limit-tu 0 BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode --limit-refs 1 --aq-mode 3 --limit-tu 3 -BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-save x265_analysis.dat --bitrate 7000 --tskip-fast --limit-tu 2::--preset veryslow --no-cutree --analysis-load x265_analysis.dat --bitrate 7000 --tskip-fast --limit-tu 2 +BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-reuse-mode=save --bitrate 7000 --tskip-fast --limit-tu 4::--preset veryslow --no-cutree --analysis-reuse-mode=load --bitrate 7000 --tskip-fast --limit-tu 4 BasketballDrive_1920x1080_50.y4m,--preset veryslow --recon-y4m-exec "ffplay -i pipe:0 -autoexit" Coastguard-4k.y4m,--preset ultrafast --recon-y4m-exec "ffplay -i pipe:0 -autoexit" Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop Coastguard-4k.y4m,--preset superfast --tune grain --pme --aq-strength 2 --merange 190 -Coastguard-4k.y4m,--preset veryfast --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 1 --bitrate 15000::--preset veryfast --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 1 --bitrate 15000 +Coastguard-4k.y4m,--preset veryfast --no-cutree --analysis-reuse-mode=save --analysis-reuse-level 1 --bitrate 15000::--preset veryfast --no-cutree --analysis-reuse-mode=load --analysis-reuse-level 1 --bitrate 15000 Coastguard-4k.y4m,--preset medium --rdoq-level 1 --tune ssim --no-signhide --me umh --slices 2 Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1 --limit-refs 1 CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16 @@ -52,7 +52,7 @@ DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0 --limit-refs 3 --tu-inter-depth 4 --limit-tu 3 -DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset fast --no-cutree --analysis-save x265_analysis.dat --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1::--preset fast --no-cutree --analysis-load x265_analysis.dat --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1 +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset fast --no-cutree --analysis-reuse-mode=save --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1::--preset fast --no-cutree --analysis-reuse-mode=load --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1 FourPeople_1280x720_60.y4m,--preset superfast --no-wpp --lookahead-slices 2 FourPeople_1280x720_60.y4m,--preset veryfast --aq-mode 2 --aq-strength 1.5 --qg-size 8 FourPeople_1280x720_60.y4m,--preset medium --qp 38 --no-psy-rd @@ -69,8 +69,8 @@ KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8 --limit-refs 0 --limit-modes --limit-tu 1 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain --limit-refs 2 -NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset slow --no-cutree --analysis-save x265_analysis.dat --rd 5 --analysis-reuse-level 10 --bitrate 9000::--preset slow --no-cutree --analysis-load x265_analysis.dat --rd 5 --analysis-reuse-level 10 --bitrate 9000 -News-4k.y4m,--preset ultrafast --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 2 --bitrate 15000::--preset ultrafast --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 2 --bitrate 15000 +NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset slow --no-cutree --analysis-reuse-mode=save --rd 5 --analysis-reuse-level 10 --bitrate 9000::--preset slow --no-cutree --analysis-reuse-mode=load --rd 5 --analysis-reuse-level 10 --bitrate 9000 +News-4k.y4m,--preset ultrafast --no-cutree --analysis-reuse-mode=save --analysis-reuse-level 2 --bitrate 15000::--preset ultrafast --no-cutree --analysis-reuse-mode=load --analysis-reuse-level 2 --bitrate 15000 News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0 News-4k.y4m,--preset superfast --slices 4 --aq-mode 0 News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 16 @@ -125,7 +125,7 @@ old_town_cross_444_720p50.y4m,--preset superfast --weightp --min-cu 16 --limit-modes old_town_cross_444_720p50.y4m,--preset veryfast --qp 1 --tune ssim old_town_cross_444_720p50.y4m,--preset faster --rd 1 --tune zero-latency -old_town_cross_444_720p50.y4m,--preset fast --no-cutree --analysis-save pass1_analysis.dat --analysis-reuse-level 1 --bitrate 3000 --early-skip::--preset fast --no-cutree --analysis-load pass1_analysis.dat --analysis-save pass2_analysis.dat --analysis-reuse-level 1 --bitrate 3000 --early-skip::--preset fast --no-cutree --analysis-load pass2_analysis.dat --analysis-reuse-level 1 --bitrate 3000 --early-skip +old_town_cross_444_720p50.y4m,--preset fast --no-cutree --analysis-reuse-mode=save --analysis-reuse-level 1 --bitrate 3000 --early-skip::--preset fast --no-cutree --analysis-reuse-mode=load --analysis-reuse-level 1 --bitrate 3000 --early-skip old_town_cross_444_720p50.y4m,--preset medium --keyint -1 --no-weightp --ref 6 old_town_cross_444_720p50.y4m,--preset slow --rdoq-level 1 --early-skip --ref 7 --no-b-pyramid old_town_cross_444_720p50.y4m,--preset slower --crf 4 --cu-lossless @@ -150,8 +150,6 @@ Kimono1_1920x1080_24_400.yuv,--preset medium --rdoq-level 0 --limit-refs 3 --slices 2 Kimono1_1920x1080_24_400.yuv,--preset veryslow --crf 4 --cu-lossless --slices 2 --limit-refs 3 --limit-modes Kimono1_1920x1080_24_400.yuv,--preset placebo --ctu 32 --max-tu-size 8 --limit-tu 2 -big_buck_bunny_360p24.y4m, --keyint 60 --min-keyint 40 --gop-lookahead 14 -BasketballDrive_1920x1080_50.y4m, --preset medium --no-open-gop --keyint 50 --min-keyint 50 --radl 2 # Main12 intraCost overflow bug test 720p50_parkrun_ter.y4m,--preset medium
View file
x265_2.7.tar.gz/source/x265.cpp -> x265_2.6.tar.gz/source/x265.cpp
Changed
@@ -301,15 +301,9 @@ if (!this->qpfile) x265_log_file(param, X265_LOG_ERROR, "%s qpfile not found or error in opening qp file\n", optarg); } - OPT("fullhelp") - { - param->logLevel = X265_LOG_FULL; - printVersion(param, api); - showHelp(param); - break; - } else bError |= !!api->param_parse(param, long_options[long_options_index].name, optarg); + if (bError) { const char *name = long_options_index > 0 ? long_options[long_options_index].name : argv[optind - 2]; @@ -585,9 +579,9 @@ x265_picture pic_orig, pic_out; x265_picture *pic_in = &pic_orig; - /* Allocate recon picture if analysis save/load is enabled */ + /* Allocate recon picture if analysisReuseMode is enabled */ std::priority_queue<int64_t>* pts_queue = cliopt.output->needPTS() ? new std::priority_queue<int64_t>() : NULL; - x265_picture *pic_recon = (cliopt.recon || param->analysisSave || param->analysisLoad || pts_queue || reconPlay || param->csvLogLevel) ? &pic_out : NULL; + x265_picture *pic_recon = (cliopt.recon || !!param->analysisReuseMode || pts_queue || reconPlay || param->csvLogLevel) ? &pic_out : NULL; uint32_t inFrameCount = 0; uint32_t outFrameCount = 0; x265_nal *p_nal;
View file
x265_2.7.tar.gz/source/x265.h -> x265_2.6.tar.gz/source/x265.h
Changed
@@ -327,15 +327,15 @@ * to allow the encoder to determine base QP */ int forceqp; - /* If param.analysisLoad and param.analysisSave are disabled, this field is - * ignored on input and output. Else the user must call x265_alloc_analysis_data() - * to allocate analysis buffers for every picture passed to the encoder. + /* If param.analysisReuseMode is X265_ANALYSIS_OFF this field is ignored on input + * and output. Else the user must call x265_alloc_analysis_data() to + * allocate analysis buffers for every picture passed to the encoder. * - * On input when param.analysisLoad is enabled and analysisData + * On input when param.analysisReuseMode is X265_ANALYSIS_LOAD and analysisData * member pointers are valid, the encoder will use the data stored here to * reduce encoder work. * - * On output when param.analysisSave is enabled and analysisData + * On output when param.analysisReuseMode is X265_ANALYSIS_SAVE and analysisData * member pointers are valid, the encoder will write output analysis into * this data structure */ x265_analysis_data analysisData; @@ -481,7 +481,9 @@ #define X265_CSP_BGRA 7 /* packed bgr 32bits */ #define X265_CSP_RGB 8 /* packed rgb 24bits */ #define X265_CSP_MAX 9 /* end of list */ + #define X265_EXTENDED_SAR 255 /* aspect ratio explicitly specified as width:height */ + /* Analysis options */ #define X265_ANALYSIS_OFF 0 #define X265_ANALYSIS_SAVE 1 @@ -1127,13 +1129,13 @@ * Default disabled */ int bEnableRdRefine; - /* If save, write per-frame analysis information into analysis buffers. - * If load, read analysis information into analysis buffer and use this - * analysis information to reduce the amount of work the encoder must perform. - * Default disabled. Now deprecated*/ + /* If X265_ANALYSIS_SAVE, write per-frame analysis information into analysis + * buffers. if X265_ANALYSIS_LOAD, read analysis information into analysis + * buffer and use this analysis information to reduce the amount of work + * the encoder must perform. Default X265_ANALYSIS_OFF */ int analysisReuseMode; - /* Filename for multi-pass-opt-analysis/distortion. Default name is "x265_analysis.dat" */ + /* Filename for analysisReuseMode save/load. Default name is "x265_analysis.dat" */ const char* analysisReuseFileName; /*== Rate Control ==*/ @@ -1271,7 +1273,6 @@ /* internally enable if tune grain is set */ int bEnableConstVbv; - } rc; /*== Video Usability Information ==*/ @@ -1454,7 +1455,7 @@ int bHDROpt; /* A value between 1 and 10 (both inclusive) determines the level of - * information stored/reused in analysis save/load. Higher the refine + * information stored/reused in save/load analysis-reuse-mode. Higher the refine * level higher the information stored/reused. Default is 5 */ int analysisReuseLevel; @@ -1531,23 +1532,9 @@ /* Reuse MV information obtained through API */ int bMVType; + /* Allow the encoder to have a copy of the planes of x265_picture in Frame */ int bCopyPicToFrame; - - /*Number of frames for GOP boundary decision lookahead.If a scenecut frame is found - * within this from the gop boundary set by keyint, the GOP will be extented until such a point, - * otherwise the GOP will be terminated as set by keyint*/ - int gopLookahead; - - /*Write per-frame analysis information into analysis buffers. Default disabled. */ - const char* analysisSave; - - /* Read analysis information into analysis buffer and use this analysis information - * to reduce the amount of work the encoder must perform. Default disabled. */ - const char* analysisLoad; - - /*Number of RADL pictures allowed in front of IDR*/ - int radl; } x265_param; /* x265_param_alloc: @@ -1756,7 +1743,7 @@ /* x265_get_ref_frame_list: * returns negative on error, 0 when access unit were output. * This API must be called after(poc >= lookaheadDepth + bframes + 2) condition check */ -int x265_get_ref_frame_list(x265_encoder *encoder, x265_picyuv**, x265_picyuv**, int, int, int*, int*); +int x265_get_ref_frame_list(x265_encoder *encoder, x265_picyuv**, x265_picyuv**, int, int); /* x265_set_analysis_data: * set the analysis data. The incoming analysis_data structure is assumed to be AVC-sized blocks. @@ -1779,10 +1766,9 @@ void x265_csvlog_frame(const x265_param *, const x265_picture *); /* Log final encode statistics to the CSV file handle. 'argc' and 'argv' are - * intended to be command line arguments passed to the encoder. padx and pady are - * padding offsets for conformance and can be given from sps settings. Encode + * intended to be command line arguments passed to the encoder. Encode * statistics should be queried from the encoder just prior to closing it. */ -void x265_csvlog_encode(const x265_param*, const x265_stats *, int padx, int pady, int argc, char** argv); +void x265_csvlog_encode(x265_encoder *encoder, const x265_stats *, int argc, char** argv); /* In-place downshift from a bit-depth greater than 8 to a bit-depth of 8, using * the residual bits to dither each row. */ @@ -1834,10 +1820,10 @@ int (*encoder_intra_refresh)(x265_encoder*); int (*encoder_ctu_info)(x265_encoder*, int, x265_ctu_info_t**); int (*get_slicetype_poc_and_scenecut)(x265_encoder*, int*, int*, int*); - int (*get_ref_frame_list)(x265_encoder*, x265_picyuv**, x265_picyuv**, int, int, int*, int*); + int (*get_ref_frame_list)(x265_encoder*, x265_picyuv**, x265_picyuv**, int, int); FILE* (*csvlog_open)(const x265_param*); void (*csvlog_frame)(const x265_param*, const x265_picture*); - void (*csvlog_encode)(const x265_param*, const x265_stats *, int, int, int, char**); + void (*csvlog_encode)(x265_encoder*, const x265_stats*, int, char**); void (*dither_image)(x265_picture*, int, int, int16_t*, int); int (*set_analysis_data)(x265_encoder *encoder, x265_analysis_data *analysis_data, int poc, uint32_t cuBytes); /* add new pointers to the end, or increment X265_MAJOR_VERSION */
View file
x265_2.7.tar.gz/source/x265cli.h -> x265_2.6.tar.gz/source/x265cli.h
Changed
@@ -38,7 +38,6 @@ static const struct option long_options[] = { { "help", no_argument, NULL, 'h' }, - { "fullhelp", no_argument, NULL, 0 }, { "version", no_argument, NULL, 'V' }, { "asm", required_argument, NULL, 0 }, { "no-asm", no_argument, NULL, 0 }, @@ -120,11 +119,9 @@ { "open-gop", no_argument, NULL, 0 }, { "keyint", required_argument, NULL, 'I' }, { "min-keyint", required_argument, NULL, 'i' }, - { "gop-lookahead", required_argument, NULL, 0 }, { "scenecut", required_argument, NULL, 0 }, { "no-scenecut", no_argument, NULL, 0 }, { "scenecut-bias", required_argument, NULL, 0 }, - { "radl", required_argument, NULL, 0 }, { "ctu-info", required_argument, NULL, 0 }, { "intra-refresh", no_argument, NULL, 0 }, { "rc-lookahead", required_argument, NULL, 0 }, @@ -255,11 +252,9 @@ { "no-slow-firstpass", no_argument, NULL, 0 }, { "multi-pass-opt-rps", no_argument, NULL, 0 }, { "no-multi-pass-opt-rps", no_argument, NULL, 0 }, - { "analysis-reuse-mode", required_argument, NULL, 0 }, /* DEPRECATED */ - { "analysis-reuse-file", required_argument, NULL, 0 }, + { "analysis-reuse-mode", required_argument, NULL, 0 }, + { "analysis-reuse-file", required_argument, NULL, 0 }, { "analysis-reuse-level", required_argument, NULL, 0 }, - { "analysis-save", required_argument, NULL, 0 }, - { "analysis-load", required_argument, NULL, 0 }, { "scale-factor", required_argument, NULL, 0 }, { "refine-intra", required_argument, NULL, 0 }, { "refine-inter", required_argument, NULL, 0 }, @@ -319,7 +314,6 @@ H0(" outfile is raw HEVC bitstream\n"); H0("\nExecutable Options:\n"); H0("-h/--help Show this help text and exit\n"); - H0(" --fullhelp Show all options and exit\n"); H0("-V/--version Show version info and exit\n"); H0("\nOutput Options:\n"); H0("-o/--output <filename> Bitstream output file name\n"); @@ -424,11 +418,9 @@ H0(" --[no-]open-gop Enable open-GOP, allows I slices to be non-IDR. Default %s\n", OPT(param->bOpenGOP)); H0("-I/--keyint <integer> Max IDR period in frames. -1 for infinite-gop. Default %d\n", param->keyframeMax); H0("-i/--min-keyint <integer> Scenecuts closer together than this are coded as I, not IDR. Default: auto\n"); - H0(" --gop-lookahead <integer> Extends gop boundary if a scenecut is found within this from keyint boundary. Default 0\n"); H0(" --no-scenecut Disable adaptive I-frame decision\n"); H0(" --scenecut <integer> How aggressively to insert extra I-frames. Default %d\n", param->scenecutThreshold); H1(" --scenecut-bias <0..100.0> Bias for scenecut detection. Default %.2f\n", param->scenecutBias); - H0(" --radl <integer> Number of RADL pictures allowed in front of IDR. Default %d\n", param->radl); H0(" --intra-refresh Use Periodic Intra Refresh instead of IDR frames\n"); H0(" --rc-lookahead <integer> Number of frames for frame-type lookahead (determines encoder latency) Default %d\n", param->lookaheadDepth); H1(" --lookahead-slices <0..16> Number of slices to use per lookahead cost estimate. Default %d\n", param->lookaheadSlices); @@ -469,19 +461,18 @@ H0(" --[no-]analyze-src-pics Motion estimation uses source frame planes. Default disable\n"); H0(" --[no-]slow-firstpass Enable a slow first pass in a multipass rate control mode. Default %s\n", OPT(param->rc.bEnableSlowFirstPass)); H0(" --[no-]strict-cbr Enable stricter conditions and tolerance for bitrate deviations in CBR mode. Default %s\n", OPT(param->rc.bStrictCbr)); - H0(" --analysis-save <filename> Dump analysis info into the specified file. Default Disabled\n"); - H0(" --analysis-load <filename> Load analysis buffers from the file specified. Default Disabled\n"); + H0(" --analysis-reuse-mode <string|int> save - Dump analysis info into file, load - Load analysis buffers from the file. Default %d\n", param->analysisReuseMode); H0(" --analysis-reuse-file <filename> Specify file name used for either dumping or reading analysis data. Deault x265_analysis.dat\n"); H0(" --analysis-reuse-level <1..10> Level of analysis reuse indicates amount of info stored/reused in save/load mode, 1:least..10:most. Default %d\n", param->analysisReuseLevel); H0(" --refine-mv-type <string> Reuse MV information received through API call. Supported option is avc. Default disabled - %d\n", param->bMVType); H0(" --scale-factor <int> Specify factor by which input video is scaled down for analysis save mode. Default %d\n", param->scaleFactor); - H0(" --refine-intra <0..3> Enable intra refinement for encode that uses analysis-load.\n" + H0(" --refine-intra <0..3> Enable intra refinement for encode that uses analysis-reuse-mode=load.\n" " - 0 : Forces both mode and depth from the save encode.\n" " - 1 : Functionality of (0) + evaluate all intra modes at min-cu-size's depth when current depth is one smaller than min-cu-size's depth.\n" " - 2 : Functionality of (1) + irrespective of size evaluate all angular modes when the save encode decides the best mode as angular.\n" " - 3 : Functionality of (1) + irrespective of size evaluate all intra modes.\n" " Default:%d\n", param->intraRefine); - H0(" --refine-inter <0..3> Enable inter refinement for encode that uses analysis-load.\n" + H0(" --refine-inter <0..3> Enable inter refinement for encode that uses analysis-reuse-mode=load.\n" " - 0 : Forces both mode and depth from the save encode.\n" " - 1 : Functionality of (0) + evaluate all inter modes at min-cu-size's depth when current depth is one smaller than\n" " min-cu-size's depth. When save encode decides the current block as skip(for all sizes) evaluate skip/merge.\n" @@ -572,8 +563,9 @@ #undef OPT #undef H0 #undef H1 + if (level < X265_LOG_DEBUG) - printf("\nUse --fullhelp for a full listing (or --log-level full --help)\n"); + printf("\nUse --log-level full --help for a full listing\n"); printf("\n\nComplete documentation may be found at http://x265.readthedocs.org/en/default/cli.html\n"); exit(1); }
View file
x265_2.7.tar.gz/source/cmake/CMakeASM_NASMInformation.cmake
Deleted
@@ -1,68 +0,0 @@ -set(ASM_DIALECT "_NASM") -set(CMAKE_ASM${ASM_DIALECT}_SOURCE_FILE_EXTENSIONS asm) - -if(X64) - list(APPEND ASM_FLAGS -DARCH_X86_64=1 -I ${CMAKE_CURRENT_SOURCE_DIR}/../common/x86/) - if(ENABLE_PIC) - list(APPEND ASM_FLAGS -DPIC) - endif() - if(APPLE) - set(ARGS -f macho64 -DPREFIX) - elseif(UNIX AND NOT CYGWIN) - set(ARGS -f elf64) - else() - set(ARGS -f win64) - endif() -else() - list(APPEND ASM_FLAGS -DARCH_X86_64=0 -I ${CMAKE_CURRENT_SOURCE_DIR}/../common/x86/) - if(APPLE) - set(ARGS -f macho32 -DPREFIX) - elseif(UNIX AND NOT CYGWIN) - set(ARGS -f elf32) - else() - set(ARGS -f win32 -DPREFIX) - endif() -endif() - -if(GCC) - list(APPEND ASM_FLAGS -DHAVE_ALIGNED_STACK=1) -else() - list(APPEND ASM_FLAGS -DHAVE_ALIGNED_STACK=0) -endif() - -if(HIGH_BIT_DEPTH) - if(MAIN12) - list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=12 -DX265_NS=${X265_NS}) - else() - list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=10 -DX265_NS=${X265_NS}) - endif() -else() - list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=0 -DBIT_DEPTH=8 -DX265_NS=${X265_NS}) -endif() - -list(APPEND ASM_FLAGS "${CMAKE_ASM_NASM_FLAGS}") - -if(CMAKE_BUILD_TYPE MATCHES Release) - list(APPEND ASM_FLAGS "${CMAKE_ASM_NASM_FLAGS_RELEASE}") -elseif(CMAKE_BUILD_TYPE MATCHES Debug) - list(APPEND ASM_FLAGS "${CMAKE_ASM_NASM_FLAGS_DEBUG}") -elseif(CMAKE_BUILD_TYPE MATCHES MinSizeRel) - list(APPEND ASM_FLAGS "${CMAKE_ASM_NASM_FLAGS_MINSIZEREL}") -elseif(CMAKE_BUILD_TYPE MATCHES RelWithDebInfo) - list(APPEND ASM_FLAGS "${CMAKE_ASM_NASM_FLAGS_RELWITHDEBINFO}") -endif() - -set(NASM_FLAGS ${ARGS} ${ASM_FLAGS} PARENT_SCOPE) -string(REPLACE ";" " " CMAKE_ASM_NASM_COMPILER_ARG1 "${ARGS}") - -# This section exists to override the one in CMakeASMInformation.cmake -# (the default Information file). This removes the <FLAGS> -# thing so that your C compiler flags that have been set via -# set_target_properties don't get passed to nasm and confuse it. -if(NOT CMAKE_ASM${ASM_DIALECT}_COMPILE_OBJECT) - string(REPLACE ";" " " STR_ASM_FLAGS "${ASM_FLAGS}") - set(CMAKE_ASM${ASM_DIALECT}_COMPILE_OBJECT "<CMAKE_ASM${ASM_DIALECT}_COMPILER> ${STR_ASM_FLAGS} -o <OBJECT> <SOURCE>") -endif() - -include(CMakeASMInformation) -set(ASM_DIALECT)
View file
x265_2.7.tar.gz/source/cmake/CMakeDetermineASM_NASMCompiler.cmake
Deleted
@@ -1,5 +0,0 @@ -set(ASM_DIALECT "_NASM") -set(CMAKE_ASM${ASM_DIALECT}_COMPILER ${NASM_EXECUTABLE}) -set(CMAKE_ASM${ASM_DIALECT}_COMPILER_INIT ${_CMAKE_TOOLCHAIN_PREFIX}nasm) -include(CMakeDetermineASMCompiler) -set(ASM_DIALECT)
View file
x265_2.7.tar.gz/source/cmake/CMakeTestASM_NASMCompiler.cmake
Deleted
@@ -1,3 +0,0 @@ -set(ASM_DIALECT "_NASM") -include(CMakeTestASMCompiler) -set(ASM_DIALECT)
View file
x265_2.7.tar.gz/source/cmake/FindNasm.cmake
Deleted
@@ -1,25 +0,0 @@ -include(FindPackageHandleStandardArgs) - -# Simple path search with YASM_ROOT environment variable override -find_program(NASM_EXECUTABLE - NAMES nasm nasm-2.13.0-win32 nasm-2.13.0-win64 nasm nasm-2.13.0-win32 nasm-2.13.0-win64 - HINTS $ENV{NASM_ROOT} ${NASM_ROOT} - PATH_SUFFIXES bin -) - -if(NASM_EXECUTABLE) - execute_process(COMMAND ${NASM_EXECUTABLE} -version - OUTPUT_VARIABLE nasm_version - ERROR_QUIET - OUTPUT_STRIP_TRAILING_WHITESPACE - ) - if(nasm_version MATCHES "^NASM version ([0-9\\.]*)") - set(NASM_VERSION_STRING "${CMAKE_MATCH_1}") - endif() - unset(nasm_version) -endif() - -# Provide standardized success/failure messages -find_package_handle_standard_args(nasm - REQUIRED_VARS NASM_EXECUTABLE - VERSION_VAR NASM_VERSION_STRING)
View file
x265_2.7.tar.gz/source/common/x86/h-ipfilter16.asm
Deleted
@@ -1,2537 +0,0 @@ -;***************************************************************************** -;* Copyright (C) 2013-2017 MulticoreWare, Inc -;* -;* Authors: Nabajit Deka <nabajit@multicorewareinc.com> -;* Murugan Vairavel <murugan@multicorewareinc.com> -;* Min Chen <chenm003@163.com> -;* -;* This program is free software; you can redistribute it and/or modify -;* it under the terms of the GNU General Public License as published by -;* the Free Software Foundation; either version 2 of the License, or -;* (at your option) any later version. -;* -;* This program is distributed in the hope that it will be useful, -;* but WITHOUT ANY WARRANTY; without even the implied warranty of -;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -;* GNU General Public License for more details. -;* -;* You should have received a copy of the GNU General Public License -;* along with this program; if not, write to the Free Software -;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. -;* -;* This program is also available under a commercial proprietary license. -;* For more information, contact us at license @ x265.com. -;*****************************************************************************/ -%include "x86inc.asm" -%include "x86util.asm" - - -%define INTERP_OFFSET_PP pd_32 -%define INTERP_SHIFT_PP 6 - -%if BIT_DEPTH == 10 - %define INTERP_SHIFT_PS 2 - %define INTERP_OFFSET_PS pd_n32768 - %define INTERP_SHIFT_SP 10 - %define INTERP_OFFSET_SP h_pd_524800 -%elif BIT_DEPTH == 12 - %define INTERP_SHIFT_PS 4 - %define INTERP_OFFSET_PS pd_n131072 - %define INTERP_SHIFT_SP 8 - %define INTERP_OFFSET_SP pd_524416 -%else - %error Unsupport bit depth! -%endif - -SECTION_RODATA 32 - -h_pd_524800: times 8 dd 524800 - -tab_LumaCoeff: dw 0, 0, 0, 64, 0, 0, 0, 0 - dw -1, 4, -10, 58, 17, -5, 1, 0 - dw -1, 4, -11, 40, 40, -11, 4, -1 - dw 0, 1, -5, 17, 58, -10, 4, -1 - -ALIGN 32 -h_tab_LumaCoeffV: times 4 dw 0, 0 - times 4 dw 0, 64 - times 4 dw 0, 0 - times 4 dw 0, 0 - - times 4 dw -1, 4 - times 4 dw -10, 58 - times 4 dw 17, -5 - times 4 dw 1, 0 - - times 4 dw -1, 4 - times 4 dw -11, 40 - times 4 dw 40, -11 - times 4 dw 4, -1 - - times 4 dw 0, 1 - times 4 dw -5, 17 - times 4 dw 58, -10 - times 4 dw 4, -1 - -const interp8_hps_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7 - -const interp8_hpp_shuf, db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9 - db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13 - -const interp8_hpp_shuf_new, db 0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 6, 7, 6, 7, 8, 9 - db 4, 5, 6, 7, 6, 7, 8, 9, 8, 9, 10, 11, 10, 11, 12, 13 - -SECTION .text -cextern pd_8 -cextern pd_32 -cextern pw_pixel_max -cextern pd_524416 -cextern pd_n32768 -cextern pd_n131072 -cextern pw_2000 -cextern idct8_shuf2 - -%macro FILTER_LUMA_HOR_4_sse2 1 - movu m4, [r0 + %1] ; m4 = src[0-7] - movu m5, [r0 + %1 + 2] ; m5 = src[1-8] - pmaddwd m4, m0 - pmaddwd m5, m0 - pshufd m2, m4, q2301 - paddd m4, m2 - pshufd m2, m5, q2301 - paddd m5, m2 - pshufd m4, m4, q3120 - pshufd m5, m5, q3120 - punpcklqdq m4, m5 - - movu m5, [r0 + %1 + 4] ; m5 = src[2-9] - movu m3, [r0 + %1 + 6] ; m3 = src[3-10] - pmaddwd m5, m0 - pmaddwd m3, m0 - pshufd m2, m5, q2301 - paddd m5, m2 - pshufd m2, m3, q2301 - paddd m3, m2 - pshufd m5, m5, q3120 - pshufd m3, m3, q3120 - punpcklqdq m5, m3 - - pshufd m2, m4, q2301 - paddd m4, m2 - pshufd m2, m5, q2301 - paddd m5, m2 - pshufd m4, m4, q3120 - pshufd m5, m5, q3120 - punpcklqdq m4, m5 - paddd m4, m1 -%endmacro - -%macro FILTER_LUMA_HOR_8_sse2 1 - movu m4, [r0 + %1] ; m4 = src[0-7] - movu m5, [r0 + %1 + 2] ; m5 = src[1-8] - pmaddwd m4, m0 - pmaddwd m5, m0 - pshufd m2, m4, q2301 - paddd m4, m2 - pshufd m2, m5, q2301 - paddd m5, m2 - pshufd m4, m4, q3120 - pshufd m5, m5, q3120 - punpcklqdq m4, m5 - - movu m5, [r0 + %1 + 4] ; m5 = src[2-9] - movu m3, [r0 + %1 + 6] ; m3 = src[3-10] - pmaddwd m5, m0 - pmaddwd m3, m0 - pshufd m2, m5, q2301 - paddd m5, m2 - pshufd m2, m3, q2301 - paddd m3, m2 - pshufd m5, m5, q3120 - pshufd m3, m3, q3120 - punpcklqdq m5, m3 - - pshufd m2, m4, q2301 - paddd m4, m2 - pshufd m2, m5, q2301 - paddd m5, m2 - pshufd m4, m4, q3120 - pshufd m5, m5, q3120 - punpcklqdq m4, m5 - paddd m4, m1 - - movu m5, [r0 + %1 + 8] ; m5 = src[4-11] - movu m6, [r0 + %1 + 10] ; m6 = src[5-12] - pmaddwd m5, m0 - pmaddwd m6, m0 - pshufd m2, m5, q2301 - paddd m5, m2 - pshufd m2, m6, q2301 - paddd m6, m2 - pshufd m5, m5, q3120 - pshufd m6, m6, q3120 - punpcklqdq m5, m6 - - movu m6, [r0 + %1 + 12] ; m6 = src[6-13] - movu m3, [r0 + %1 + 14] ; m3 = src[7-14] - pmaddwd m6, m0 - pmaddwd m3, m0 - pshufd m2, m6, q2301 - paddd m6, m2 - pshufd m2, m3, q2301 - paddd m3, m2 - pshufd m6, m6, q3120 - pshufd m3, m3, q3120 - punpcklqdq m6, m3 - - pshufd m2, m5, q2301 - paddd m5, m2 - pshufd m2, m6, q2301 - paddd m6, m2 - pshufd m5, m5, q3120 - pshufd m6, m6, q3120 - punpcklqdq m5, m6 - paddd m5, m1 -%endmacro - -;------------------------------------------------------------------------------------------------------------ -; void interp_8tap_horiz_p%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;------------------------------------------------------------------------------------------------------------ -%macro FILTER_HOR_LUMA_sse2 3 -INIT_XMM sse2 -cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 8 - mov r4d, r4m - sub r0, 6 - shl r4d, 4 - add r1d, r1d - add r3d, r3d - -%ifdef PIC - lea r6, [tab_LumaCoeff] - mova m0, [r6 + r4] -%else - mova m0, [tab_LumaCoeff + r4] -%endif - -%ifidn %3, pp - mova m1, [pd_32] - pxor m7, m7 -%else - mova m1, [INTERP_OFFSET_PS] -%endif - - mov r4d, %2 -%ifidn %3, ps - cmp r5m, byte 0 - je .loopH - lea r6, [r1 + 2 * r1] - sub r0, r6 - add r4d, 7 -%endif - -.loopH: -%assign x 0 -%rep %1/8 - FILTER_LUMA_HOR_8_sse2 x - -%ifidn %3, pp - psrad m4, 6 - psrad m5, 6 - packssdw m4, m5 - CLIPW m4, m7, [pw_pixel_max] -%else - %if BIT_DEPTH == 10 - psrad m4, 2 - psrad m5, 2 - %elif BIT_DEPTH == 12 - psrad m4, 4 - psrad m5, 4 - %endif - packssdw m4, m5 -%endif - - movu [r2 + x], m4 -%assign x x+16 -%endrep - -%rep (%1 % 8)/4 - FILTER_LUMA_HOR_4_sse2 x - -%ifidn %3, pp - psrad m4, 6 - packssdw m4, m4 - CLIPW m4, m7, [pw_pixel_max] -%else - %if BIT_DEPTH == 10 - psrad m4, 2 - %elif BIT_DEPTH == 12 - psrad m4, 4 - %endif - packssdw m4, m4 -%endif - - movh [r2 + x], m4 -%endrep - - add r0, r1 - add r2, r3 - - dec r4d - jnz .loopH - RET - -%endmacro - -;------------------------------------------------------------------------------------------------------------ -; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------ - FILTER_HOR_LUMA_sse2 4, 4, pp - FILTER_HOR_LUMA_sse2 4, 8, pp - FILTER_HOR_LUMA_sse2 4, 16, pp - FILTER_HOR_LUMA_sse2 8, 4, pp - FILTER_HOR_LUMA_sse2 8, 8, pp - FILTER_HOR_LUMA_sse2 8, 16, pp - FILTER_HOR_LUMA_sse2 8, 32, pp - FILTER_HOR_LUMA_sse2 12, 16, pp - FILTER_HOR_LUMA_sse2 16, 4, pp - FILTER_HOR_LUMA_sse2 16, 8, pp - FILTER_HOR_LUMA_sse2 16, 12, pp - FILTER_HOR_LUMA_sse2 16, 16, pp - FILTER_HOR_LUMA_sse2 16, 32, pp - FILTER_HOR_LUMA_sse2 16, 64, pp - FILTER_HOR_LUMA_sse2 24, 32, pp - FILTER_HOR_LUMA_sse2 32, 8, pp - FILTER_HOR_LUMA_sse2 32, 16, pp - FILTER_HOR_LUMA_sse2 32, 24, pp - FILTER_HOR_LUMA_sse2 32, 32, pp - FILTER_HOR_LUMA_sse2 32, 64, pp - FILTER_HOR_LUMA_sse2 48, 64, pp - FILTER_HOR_LUMA_sse2 64, 16, pp - FILTER_HOR_LUMA_sse2 64, 32, pp - FILTER_HOR_LUMA_sse2 64, 48, pp - FILTER_HOR_LUMA_sse2 64, 64, pp - -;--------------------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;--------------------------------------------------------------------------------------------------------------------------- - FILTER_HOR_LUMA_sse2 4, 4, ps - FILTER_HOR_LUMA_sse2 4, 8, ps - FILTER_HOR_LUMA_sse2 4, 16, ps - FILTER_HOR_LUMA_sse2 8, 4, ps - FILTER_HOR_LUMA_sse2 8, 8, ps - FILTER_HOR_LUMA_sse2 8, 16, ps - FILTER_HOR_LUMA_sse2 8, 32, ps - FILTER_HOR_LUMA_sse2 12, 16, ps - FILTER_HOR_LUMA_sse2 16, 4, ps - FILTER_HOR_LUMA_sse2 16, 8, ps - FILTER_HOR_LUMA_sse2 16, 12, ps - FILTER_HOR_LUMA_sse2 16, 16, ps - FILTER_HOR_LUMA_sse2 16, 32, ps - FILTER_HOR_LUMA_sse2 16, 64, ps - FILTER_HOR_LUMA_sse2 24, 32, ps - FILTER_HOR_LUMA_sse2 32, 8, ps - FILTER_HOR_LUMA_sse2 32, 16, ps - FILTER_HOR_LUMA_sse2 32, 24, ps - FILTER_HOR_LUMA_sse2 32, 32, ps - FILTER_HOR_LUMA_sse2 32, 64, ps - FILTER_HOR_LUMA_sse2 48, 64, ps - FILTER_HOR_LUMA_sse2 64, 16, ps - FILTER_HOR_LUMA_sse2 64, 32, ps - FILTER_HOR_LUMA_sse2 64, 48, ps - FILTER_HOR_LUMA_sse2 64, 64, ps - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_HOR_CHROMA_sse3 3 -INIT_XMM sse3 -cglobal interp_4tap_horiz_%3_%1x%2, 4, 7, 8 - add r3, r3 - add r1, r1 - sub r0, 2 - mov r4d, r4m - add r4d, r4d - -%ifdef PIC - lea r6, [tab_ChromaCoeff] - movddup m0, [r6 + r4 * 4] -%else - movddup m0, [tab_ChromaCoeff + r4 * 4] -%endif - -%ifidn %3, ps - mova m1, [INTERP_OFFSET_PS] - cmp r5m, byte 0 -%if %1 <= 6 - lea r4, [r1 * 3] - lea r5, [r3 * 3] -%endif - je .skip - sub r0, r1 -%if %1 <= 6 -%assign y 1 -%else -%assign y 3 -%endif -%assign z 0 -%rep y -%assign x 0 -%rep %1/8 - FILTERH_W8_1_sse3 x, %3 -%assign x x+16 -%endrep -%if %1 == 4 || (%1 == 6 && z == 0) || (%1 == 12 && z == 0) - FILTERH_W4_2_sse3 x, %3 - FILTERH_W4_1_sse3 x -%assign x x+8 -%endif -%if %1 == 2 || (%1 == 6 && z == 0) - FILTERH_W2_3_sse3 x -%endif -%if %1 <= 6 - lea r0, [r0 + r4] - lea r2, [r2 + r5] -%else - lea r0, [r0 + r1] - lea r2, [r2 + r3] -%endif -%assign z z+1 -%endrep -.skip: -%elifidn %3, pp - pxor m7, m7 - mova m6, [pw_pixel_max] - mova m1, [tab_c_32] -%if %1 == 2 || %1 == 6 - lea r4, [r1 * 3] - lea r5, [r3 * 3] -%endif -%endif - -%if %1 == 2 -%assign y %2/4 -%elif %1 <= 6 -%assign y %2/2 -%else -%assign y %2 -%endif -%assign z 0 -%rep y -%assign x 0 -%rep %1/8 - FILTERH_W8_1_sse3 x, %3 -%assign x x+16 -%endrep -%if %1 == 4 || %1 == 6 || (%1 == 12 && (z % 2) == 0) - FILTERH_W4_2_sse3 x, %3 -%assign x x+8 -%endif -%if %1 == 2 || (%1 == 6 && (z % 2) == 0) - FILTERH_W2_4_sse3 x, %3 -%endif -%assign z z+1 -%if z < y -%if %1 == 2 - lea r0, [r0 + 4 * r1] - lea r2, [r2 + 4 * r3] -%elif %1 <= 6 - lea r0, [r0 + 2 * r1] - lea r2, [r2 + 2 * r3] -%else - lea r0, [r0 + r1] - lea r2, [r2 + r3] -%endif -%endif ;z < y -%endrep - - RET -%endmacro - -%macro FILTER_P2S_2_4_sse2 1 - movd m0, [r0 + %1] - movd m2, [r0 + r1 * 2 + %1] - movhps m0, [r0 + r1 + %1] - movhps m2, [r0 + r4 + %1] - psllw m0, (14 - BIT_DEPTH) - psllw m2, (14 - BIT_DEPTH) - psubw m0, m1 - psubw m2, m1 - - movd [r2 + r3 * 0 + %1], m0 - movd [r2 + r3 * 2 + %1], m2 - movhlps m0, m0 - movhlps m2, m2 - movd [r2 + r3 * 1 + %1], m0 - movd [r2 + r5 + %1], m2 -%endmacro - -%macro FILTER_P2S_4_4_sse2 1 - movh m0, [r0 + %1] - movhps m0, [r0 + r1 + %1] - psllw m0, (14 - BIT_DEPTH) - psubw m0, m1 - movh [r2 + r3 * 0 + %1], m0 - movhps [r2 + r3 * 1 + %1], m0 - - movh m2, [r0 + r1 * 2 + %1] - movhps m2, [r0 + r4 + %1] - psllw m2, (14 - BIT_DEPTH) - psubw m2, m1 - movh [r2 + r3 * 2 + %1], m2 - movhps [r2 + r5 + %1], m2 -%endmacro - -%macro FILTER_P2S_4_2_sse2 0 - movh m0, [r0] - movhps m0, [r0 + r1 * 2] - psllw m0, (14 - BIT_DEPTH) - psubw m0, [pw_2000] - movh [r2 + r3 * 0], m0 - movhps [r2 + r3 * 2], m0 -%endmacro - -%macro FILTER_P2S_8_4_sse2 1 - movu m0, [r0 + %1] - movu m2, [r0 + r1 + %1] - psllw m0, (14 - BIT_DEPTH) - psllw m2, (14 - BIT_DEPTH) - psubw m0, m1 - psubw m2, m1 - movu [r2 + r3 * 0 + %1], m0 - movu [r2 + r3 * 1 + %1], m2 - - movu m3, [r0 + r1 * 2 + %1] - movu m4, [r0 + r4 + %1] - psllw m3, (14 - BIT_DEPTH) - psllw m4, (14 - BIT_DEPTH) - psubw m3, m1 - psubw m4, m1 - movu [r2 + r3 * 2 + %1], m3 - movu [r2 + r5 + %1], m4 -%endmacro - -%macro FILTER_P2S_8_2_sse2 1 - movu m0, [r0 + %1] - movu m2, [r0 + r1 + %1] - psllw m0, (14 - BIT_DEPTH) - psllw m2, (14 - BIT_DEPTH) - psubw m0, m1 - psubw m2, m1 - movu [r2 + r3 * 0 + %1], m0 - movu [r2 + r3 * 1 + %1], m2 -%endmacro - -;----------------------------------------------------------------------------- -; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) -;----------------------------------------------------------------------------- -%macro FILTER_PIX_TO_SHORT_sse2 2 -INIT_XMM sse2 -cglobal filterPixelToShort_%1x%2, 4, 6, 3 -%if %2 == 2 -%if %1 == 4 - FILTER_P2S_4_2_sse2 -%elif %1 == 8 - add r1d, r1d - add r3d, r3d - mova m1, [pw_2000] - FILTER_P2S_8_2_sse2 0 -%endif -%else - add r1d, r1d - add r3d, r3d - mova m1, [pw_2000] - lea r4, [r1 * 3] - lea r5, [r3 * 3] -%assign y 1 -%rep %2/4 -%assign x 0 -%rep %1/8 - FILTER_P2S_8_4_sse2 x -%if %2 == 6 - lea r0, [r0 + 4 * r1] - lea r2, [r2 + 4 * r3] - FILTER_P2S_8_2_sse2 x -%endif -%assign x x+16 -%endrep -%rep (%1 % 8)/4 - FILTER_P2S_4_4_sse2 x -%assign x x+8 -%endrep -%rep (%1 % 4)/2 - FILTER_P2S_2_4_sse2 x -%endrep -%if y < %2/4 - lea r0, [r0 + 4 * r1] - lea r2, [r2 + 4 * r3] -%assign y y+1 -%endif -%endrep -%endif -RET -%endmacro - - FILTER_PIX_TO_SHORT_sse2 2, 4 - FILTER_PIX_TO_SHORT_sse2 2, 8 - FILTER_PIX_TO_SHORT_sse2 2, 16 - FILTER_PIX_TO_SHORT_sse2 4, 2 - FILTER_PIX_TO_SHORT_sse2 4, 4 - FILTER_PIX_TO_SHORT_sse2 4, 8 - FILTER_PIX_TO_SHORT_sse2 4, 16 - FILTER_PIX_TO_SHORT_sse2 4, 32 - FILTER_PIX_TO_SHORT_sse2 6, 8 - FILTER_PIX_TO_SHORT_sse2 6, 16 - FILTER_PIX_TO_SHORT_sse2 8, 2 - FILTER_PIX_TO_SHORT_sse2 8, 4 - FILTER_PIX_TO_SHORT_sse2 8, 6 - FILTER_PIX_TO_SHORT_sse2 8, 8 - FILTER_PIX_TO_SHORT_sse2 8, 12 - FILTER_PIX_TO_SHORT_sse2 8, 16 - FILTER_PIX_TO_SHORT_sse2 8, 32 - FILTER_PIX_TO_SHORT_sse2 8, 64 - FILTER_PIX_TO_SHORT_sse2 12, 16 - FILTER_PIX_TO_SHORT_sse2 12, 32 - FILTER_PIX_TO_SHORT_sse2 16, 4 - FILTER_PIX_TO_SHORT_sse2 16, 8 - FILTER_PIX_TO_SHORT_sse2 16, 12 - FILTER_PIX_TO_SHORT_sse2 16, 16 - FILTER_PIX_TO_SHORT_sse2 16, 24 - FILTER_PIX_TO_SHORT_sse2 16, 32 - FILTER_PIX_TO_SHORT_sse2 16, 64 - FILTER_PIX_TO_SHORT_sse2 24, 32 - FILTER_PIX_TO_SHORT_sse2 24, 64 - FILTER_PIX_TO_SHORT_sse2 32, 8 - FILTER_PIX_TO_SHORT_sse2 32, 16 - FILTER_PIX_TO_SHORT_sse2 32, 24 - FILTER_PIX_TO_SHORT_sse2 32, 32 - FILTER_PIX_TO_SHORT_sse2 32, 48 - FILTER_PIX_TO_SHORT_sse2 32, 64 - FILTER_PIX_TO_SHORT_sse2 48, 64 - FILTER_PIX_TO_SHORT_sse2 64, 16 - FILTER_PIX_TO_SHORT_sse2 64, 32 - FILTER_PIX_TO_SHORT_sse2 64, 48 - FILTER_PIX_TO_SHORT_sse2 64, 64 - -;------------------------------------------------------------------------------------------------------------ -; void interp_8tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;------------------------------------------------------------------------------------------------------------ -%macro FILTER_HOR_LUMA_W4 3 -INIT_XMM sse4 -cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 8 - mov r4d, r4m - sub r0, 6 - shl r4d, 4 - add r1, r1 - add r3, r3 - -%ifdef PIC - lea r6, [tab_LumaCoeff] - mova m0, [r6 + r4] -%else - mova m0, [tab_LumaCoeff + r4] -%endif - -%ifidn %3, pp - mova m1, [pd_32] - pxor m6, m6 - mova m7, [pw_pixel_max] -%else - mova m1, [INTERP_OFFSET_PS] -%endif - - mov r4d, %2 -%ifidn %3, ps - cmp r5m, byte 0 - je .loopH - lea r6, [r1 + 2 * r1] - sub r0, r6 - add r4d, 7 -%endif - -.loopH: - movu m2, [r0] ; m2 = src[0-7] - movu m3, [r0 + 16] ; m3 = src[8-15] - - pmaddwd m4, m2, m0 - palignr m5, m3, m2, 2 ; m5 = src[1-8] - pmaddwd m5, m0 - phaddd m4, m5 - - palignr m5, m3, m2, 4 ; m5 = src[2-9] - pmaddwd m5, m0 - palignr m3, m2, 6 ; m3 = src[3-10] - pmaddwd m3, m0 - phaddd m5, m3 - - phaddd m4, m5 - paddd m4, m1 -%ifidn %3, pp - psrad m4, 6 - packusdw m4, m4 - CLIPW m4, m6, m7 -%else - psrad m4, INTERP_SHIFT_PS - packssdw m4, m4 -%endif - - movh [r2], m4 - - add r0, r1 - add r2, r3 - - dec r4d - jnz .loopH - RET -%endmacro - -;------------------------------------------------------------------------------------------------------------ -; void interp_8tap_horiz_pp_4x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------ -FILTER_HOR_LUMA_W4 4, 4, pp -FILTER_HOR_LUMA_W4 4, 8, pp -FILTER_HOR_LUMA_W4 4, 16, pp - -;--------------------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_ps_4x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;--------------------------------------------------------------------------------------------------------------------------- -FILTER_HOR_LUMA_W4 4, 4, ps -FILTER_HOR_LUMA_W4 4, 8, ps -FILTER_HOR_LUMA_W4 4, 16, ps - -;------------------------------------------------------------------------------------------------------------ -; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;------------------------------------------------------------------------------------------------------------ -%macro FILTER_HOR_LUMA_W8 3 -INIT_XMM sse4 -cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 8 - - add r1, r1 - add r3, r3 - mov r4d, r4m - sub r0, 6 - shl r4d, 4 - -%ifdef PIC - lea r6, [tab_LumaCoeff] - mova m0, [r6 + r4] -%else - mova m0, [tab_LumaCoeff + r4] -%endif - -%ifidn %3, pp - mova m1, [pd_32] - pxor m7, m7 -%else - mova m1, [INTERP_OFFSET_PS] -%endif - - mov r4d, %2 -%ifidn %3, ps - cmp r5m, byte 0 - je .loopH - lea r6, [r1 + 2 * r1] - sub r0, r6 - add r4d, 7 -%endif - -.loopH: - movu m2, [r0] ; m2 = src[0-7] - movu m3, [r0 + 16] ; m3 = src[8-15] - - pmaddwd m4, m2, m0 - palignr m5, m3, m2, 2 ; m5 = src[1-8] - pmaddwd m5, m0 - phaddd m4, m5 - - palignr m5, m3, m2, 4 ; m5 = src[2-9] - pmaddwd m5, m0 - palignr m6, m3, m2, 6 ; m6 = src[3-10] - pmaddwd m6, m0 - phaddd m5, m6 - phaddd m4, m5 - paddd m4, m1 - - palignr m5, m3, m2, 8 ; m5 = src[4-11] - pmaddwd m5, m0 - palignr m6, m3, m2, 10 ; m6 = src[5-12] - pmaddwd m6, m0 - phaddd m5, m6 - - palignr m6, m3, m2, 12 ; m6 = src[6-13] - pmaddwd m6, m0 - palignr m3, m2, 14 ; m3 = src[7-14] - pmaddwd m3, m0 - phaddd m6, m3 - phaddd m5, m6 - paddd m5, m1 -%ifidn %3, pp - psrad m4, 6 - psrad m5, 6 - packusdw m4, m5 - CLIPW m4, m7, [pw_pixel_max] -%else - psrad m4, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m4, m5 -%endif - - movu [r2], m4 - - add r0, r1 - add r2, r3 - - dec r4d - jnz .loopH - RET -%endmacro - -;------------------------------------------------------------------------------------------------------------ -; void interp_8tap_horiz_pp_8x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------ -FILTER_HOR_LUMA_W8 8, 4, pp -FILTER_HOR_LUMA_W8 8, 8, pp -FILTER_HOR_LUMA_W8 8, 16, pp -FILTER_HOR_LUMA_W8 8, 32, pp - -;--------------------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_ps_8x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;--------------------------------------------------------------------------------------------------------------------------- -FILTER_HOR_LUMA_W8 8, 4, ps -FILTER_HOR_LUMA_W8 8, 8, ps -FILTER_HOR_LUMA_W8 8, 16, ps -FILTER_HOR_LUMA_W8 8, 32, ps - -;-------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;-------------------------------------------------------------------------------------------------------------- -%macro FILTER_HOR_LUMA_W12 3 -INIT_XMM sse4 -cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 8 - - add r1, r1 - add r3, r3 - mov r4d, r4m - sub r0, 6 - shl r4d, 4 - -%ifdef PIC - lea r6, [tab_LumaCoeff] - mova m0, [r6 + r4] -%else - mova m0, [tab_LumaCoeff + r4] -%endif -%ifidn %3, pp - mova m1, [INTERP_OFFSET_PP] -%else - mova m1, [INTERP_OFFSET_PS] -%endif - - mov r4d, %2 -%ifidn %3, ps - cmp r5m, byte 0 - je .loopH - lea r6, [r1 + 2 * r1] - sub r0, r6 - add r4d, 7 -%endif - -.loopH: - movu m2, [r0] ; m2 = src[0-7] - movu m3, [r0 + 16] ; m3 = src[8-15] - - pmaddwd m4, m2, m0 - palignr m5, m3, m2, 2 ; m5 = src[1-8] - pmaddwd m5, m0 - phaddd m4, m5 - - palignr m5, m3, m2, 4 ; m5 = src[2-9] - pmaddwd m5, m0 - palignr m6, m3, m2, 6 ; m6 = src[3-10] - pmaddwd m6, m0 - phaddd m5, m6 - phaddd m4, m5 - paddd m4, m1 - - palignr m5, m3, m2, 8 ; m5 = src[4-11] - pmaddwd m5, m0 - palignr m6, m3, m2, 10 ; m6 = src[5-12] - pmaddwd m6, m0 - phaddd m5, m6 - - palignr m6, m3, m2, 12 ; m6 = src[6-13] - pmaddwd m6, m0 - palignr m7, m3, m2, 14 ; m2 = src[7-14] - pmaddwd m7, m0 - phaddd m6, m7 - phaddd m5, m6 - paddd m5, m1 -%ifidn %3, pp - psrad m4, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m4, m5 - pxor m5, m5 - CLIPW m4, m5, [pw_pixel_max] -%else - psrad m4, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m4, m5 -%endif - - movu [r2], m4 - - movu m2, [r0 + 32] ; m2 = src[16-23] - - pmaddwd m4, m3, m0 ; m3 = src[8-15] - palignr m5, m2, m3, 2 ; m5 = src[9-16] - pmaddwd m5, m0 - phaddd m4, m5 - - palignr m5, m2, m3, 4 ; m5 = src[10-17] - pmaddwd m5, m0 - palignr m2, m3, 6 ; m2 = src[11-18] - pmaddwd m2, m0 - phaddd m5, m2 - phaddd m4, m5 - paddd m4, m1 -%ifidn %3, pp - psrad m4, INTERP_SHIFT_PP - packusdw m4, m4 - pxor m5, m5 - CLIPW m4, m5, [pw_pixel_max] -%else - psrad m4, INTERP_SHIFT_PS - packssdw m4, m4 -%endif - - movh [r2 + 16], m4 - - add r0, r1 - add r2, r3 - - dec r4d - jnz .loopH - RET -%endmacro -;------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp_12x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -FILTER_HOR_LUMA_W12 12, 16, pp - -;---------------------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_ps_12x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;---------------------------------------------------------------------------------------------------------------------------- -FILTER_HOR_LUMA_W12 12, 16, ps - -;-------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;-------------------------------------------------------------------------------------------------------------- -%macro FILTER_HOR_LUMA_W16 3 -INIT_XMM sse4 -cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 8 - - add r1, r1 - add r3, r3 - mov r4d, r4m - sub r0, 6 - shl r4d, 4 - -%ifdef PIC - lea r6, [tab_LumaCoeff] - mova m0, [r6 + r4] -%else - mova m0, [tab_LumaCoeff + r4] -%endif - -%ifidn %3, pp - mova m1, [pd_32] -%else - mova m1, [INTERP_OFFSET_PS] -%endif - - mov r4d, %2 -%ifidn %3, ps - cmp r5m, byte 0 - je .loopH - lea r6, [r1 + 2 * r1] - sub r0, r6 - add r4d, 7 -%endif - -.loopH: -%assign x 0 -%rep %1 / 16 - movu m2, [r0 + x] ; m2 = src[0-7] - movu m3, [r0 + 16 + x] ; m3 = src[8-15] - - pmaddwd m4, m2, m0 - palignr m5, m3, m2, 2 ; m5 = src[1-8] - pmaddwd m5, m0 - phaddd m4, m5 - - palignr m5, m3, m2, 4 ; m5 = src[2-9] - pmaddwd m5, m0 - palignr m6, m3, m2, 6 ; m6 = src[3-10] - pmaddwd m6, m0 - phaddd m5, m6 - phaddd m4, m5 - paddd m4, m1 - - palignr m5, m3, m2, 8 ; m5 = src[4-11] - pmaddwd m5, m0 - palignr m6, m3, m2, 10 ; m6 = src[5-12] - pmaddwd m6, m0 - phaddd m5, m6 - - palignr m6, m3, m2, 12 ; m6 = src[6-13] - pmaddwd m6, m0 - palignr m7, m3, m2, 14 ; m2 = src[7-14] - pmaddwd m7, m0 - phaddd m6, m7 - phaddd m5, m6 - paddd m5, m1 -%ifidn %3, pp - psrad m4, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m4, m5 - pxor m5, m5 - CLIPW m4, m5, [pw_pixel_max] -%else - psrad m4, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m4, m5 -%endif - movu [r2 + x], m4 - - movu m2, [r0 + 32 + x] ; m2 = src[16-23] - - pmaddwd m4, m3, m0 ; m3 = src[8-15] - palignr m5, m2, m3, 2 ; m5 = src[9-16] - pmaddwd m5, m0 - phaddd m4, m5 - - palignr m5, m2, m3, 4 ; m5 = src[10-17] - pmaddwd m5, m0 - palignr m6, m2, m3, 6 ; m6 = src[11-18] - pmaddwd m6, m0 - phaddd m5, m6 - phaddd m4, m5 - paddd m4, m1 - - palignr m5, m2, m3, 8 ; m5 = src[12-19] - pmaddwd m5, m0 - palignr m6, m2, m3, 10 ; m6 = src[13-20] - pmaddwd m6, m0 - phaddd m5, m6 - - palignr m6, m2, m3, 12 ; m6 = src[14-21] - pmaddwd m6, m0 - palignr m2, m3, 14 ; m3 = src[15-22] - pmaddwd m2, m0 - phaddd m6, m2 - phaddd m5, m6 - paddd m5, m1 -%ifidn %3, pp - psrad m4, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m4, m5 - pxor m5, m5 - CLIPW m4, m5, [pw_pixel_max] -%else - psrad m4, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m4, m5 -%endif - movu [r2 + 16 + x], m4 - -%assign x x+32 -%endrep - - add r0, r1 - add r2, r3 - - dec r4d - jnz .loopH - RET -%endmacro - -;------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp_16x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -FILTER_HOR_LUMA_W16 16, 4, pp -FILTER_HOR_LUMA_W16 16, 8, pp -FILTER_HOR_LUMA_W16 16, 12, pp -FILTER_HOR_LUMA_W16 16, 16, pp -FILTER_HOR_LUMA_W16 16, 32, pp -FILTER_HOR_LUMA_W16 16, 64, pp - -;---------------------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_ps_16x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;---------------------------------------------------------------------------------------------------------------------------- -FILTER_HOR_LUMA_W16 16, 4, ps -FILTER_HOR_LUMA_W16 16, 8, ps -FILTER_HOR_LUMA_W16 16, 12, ps -FILTER_HOR_LUMA_W16 16, 16, ps -FILTER_HOR_LUMA_W16 16, 32, ps -FILTER_HOR_LUMA_W16 16, 64, ps - -;------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp_32x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -FILTER_HOR_LUMA_W16 32, 8, pp -FILTER_HOR_LUMA_W16 32, 16, pp -FILTER_HOR_LUMA_W16 32, 24, pp -FILTER_HOR_LUMA_W16 32, 32, pp -FILTER_HOR_LUMA_W16 32, 64, pp - -;---------------------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_ps_32x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;---------------------------------------------------------------------------------------------------------------------------- -FILTER_HOR_LUMA_W16 32, 8, ps -FILTER_HOR_LUMA_W16 32, 16, ps -FILTER_HOR_LUMA_W16 32, 24, ps -FILTER_HOR_LUMA_W16 32, 32, ps -FILTER_HOR_LUMA_W16 32, 64, ps - -;------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp_48x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -FILTER_HOR_LUMA_W16 48, 64, pp - -;---------------------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_ps_48x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;---------------------------------------------------------------------------------------------------------------------------- -FILTER_HOR_LUMA_W16 48, 64, ps - -;------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp_64x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -FILTER_HOR_LUMA_W16 64, 16, pp -FILTER_HOR_LUMA_W16 64, 32, pp -FILTER_HOR_LUMA_W16 64, 48, pp -FILTER_HOR_LUMA_W16 64, 64, pp - -;---------------------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_ps_64x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;---------------------------------------------------------------------------------------------------------------------------- -FILTER_HOR_LUMA_W16 64, 16, ps -FILTER_HOR_LUMA_W16 64, 32, ps -FILTER_HOR_LUMA_W16 64, 48, ps -FILTER_HOR_LUMA_W16 64, 64, ps - -;-------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;-------------------------------------------------------------------------------------------------------------- -%macro FILTER_HOR_LUMA_W24 3 -INIT_XMM sse4 -cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 8 - - add r1, r1 - add r3, r3 - mov r4d, r4m - sub r0, 6 - shl r4d, 4 - -%ifdef PIC - lea r6, [tab_LumaCoeff] - mova m0, [r6 + r4] -%else - mova m0, [tab_LumaCoeff + r4] -%endif -%ifidn %3, pp - mova m1, [pd_32] -%else - mova m1, [INTERP_OFFSET_PS] -%endif - - mov r4d, %2 -%ifidn %3, ps - cmp r5m, byte 0 - je .loopH - lea r6, [r1 + 2 * r1] - sub r0, r6 - add r4d, 7 -%endif - -.loopH: - movu m2, [r0] ; m2 = src[0-7] - movu m3, [r0 + 16] ; m3 = src[8-15] - - pmaddwd m4, m2, m0 - palignr m5, m3, m2, 2 ; m5 = src[1-8] - pmaddwd m5, m0 - phaddd m4, m5 - - palignr m5, m3, m2, 4 ; m5 = src[2-9] - pmaddwd m5, m0 - palignr m6, m3, m2, 6 ; m6 = src[3-10] - pmaddwd m6, m0 - phaddd m5, m6 - phaddd m4, m5 - paddd m4, m1 - - palignr m5, m3, m2, 8 ; m5 = src[4-11] - pmaddwd m5, m0 - palignr m6, m3, m2, 10 ; m6 = src[5-12] - pmaddwd m6, m0 - phaddd m5, m6 - - palignr m6, m3, m2, 12 ; m6 = src[6-13] - pmaddwd m6, m0 - palignr m7, m3, m2, 14 ; m7 = src[7-14] - pmaddwd m7, m0 - phaddd m6, m7 - phaddd m5, m6 - paddd m5, m1 -%ifidn %3, pp - psrad m4, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m4, m5 - pxor m5, m5 - CLIPW m4, m5, [pw_pixel_max] -%else - psrad m4, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m4, m5 -%endif - movu [r2], m4 - - movu m2, [r0 + 32] ; m2 = src[16-23] - - pmaddwd m4, m3, m0 ; m3 = src[8-15] - palignr m5, m2, m3, 2 ; m5 = src[1-8] - pmaddwd m5, m0 - phaddd m4, m5 - - palignr m5, m2, m3, 4 ; m5 = src[2-9] - pmaddwd m5, m0 - palignr m6, m2, m3, 6 ; m6 = src[3-10] - pmaddwd m6, m0 - phaddd m5, m6 - phaddd m4, m5 - paddd m4, m1 - - palignr m5, m2, m3, 8 ; m5 = src[4-11] - pmaddwd m5, m0 - palignr m6, m2, m3, 10 ; m6 = src[5-12] - pmaddwd m6, m0 - phaddd m5, m6 - - palignr m6, m2, m3, 12 ; m6 = src[6-13] - pmaddwd m6, m0 - palignr m7, m2, m3, 14 ; m7 = src[7-14] - pmaddwd m7, m0 - phaddd m6, m7 - phaddd m5, m6 - paddd m5, m1 -%ifidn %3, pp - psrad m4, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m4, m5 - pxor m5, m5 - CLIPW m4, m5, [pw_pixel_max] -%else - psrad m4, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m4, m5 -%endif - movu [r2 + 16], m4 - - movu m3, [r0 + 48] ; m3 = src[24-31] - - pmaddwd m4, m2, m0 ; m2 = src[16-23] - palignr m5, m3, m2, 2 ; m5 = src[1-8] - pmaddwd m5, m0 - phaddd m4, m5 - - palignr m5, m3, m2, 4 ; m5 = src[2-9] - pmaddwd m5, m0 - palignr m6, m3, m2, 6 ; m6 = src[3-10] - pmaddwd m6, m0 - phaddd m5, m6 - phaddd m4, m5 - paddd m4, m1 - - palignr m5, m3, m2, 8 ; m5 = src[4-11] - pmaddwd m5, m0 - palignr m6, m3, m2, 10 ; m6 = src[5-12] - pmaddwd m6, m0 - phaddd m5, m6 - - palignr m6, m3, m2, 12 ; m6 = src[6-13] - pmaddwd m6, m0 - palignr m7, m3, m2, 14 ; m7 = src[7-14] - pmaddwd m7, m0 - phaddd m6, m7 - phaddd m5, m6 - paddd m5, m1 -%ifidn %3, pp - psrad m4, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m4, m5 - pxor m5, m5 - CLIPW m4, m5, [pw_pixel_max] -%else - psrad m4, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m4, m5 -%endif - movu [r2 + 32], m4 - - add r0, r1 - add r2, r3 - - dec r4d - jnz .loopH - RET -%endmacro - -;------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp_24x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -FILTER_HOR_LUMA_W24 24, 32, pp - -;---------------------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_ps_24x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;---------------------------------------------------------------------------------------------------------------------------- -FILTER_HOR_LUMA_W24 24, 32, ps - -;------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -%macro FILTER_HOR_LUMA_W4_avx2 1 -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_4x%1, 4,7,7 - add r1d, r1d - add r3d, r3d - sub r0, 6 - mov r4d, r4m - shl r4d, 4 -%ifdef PIC - lea r5, [tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4] - vpbroadcastq m1, [r5 + r4 + 8] -%else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] -%endif - lea r6, [pw_pixel_max] - mova m3, [interp8_hpp_shuf] - mova m6, [pd_32] - pxor m2, m2 - - ; register map - ; m0 , m1 interpolate coeff - - mov r4d, %1/2 - -.loop: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - phaddd m4, m4 - vpermq m4, m4, q3120 - paddd m4, m6 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [r6] - movq [r2], xm4 - - vbroadcasti128 m4, [r0 + r1] - vbroadcasti128 m5, [r0 + r1 + 8] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - phaddd m4, m4 - vpermq m4, m4, q3120 - paddd m4, m6 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [r6] - movq [r2 + r3], xm4 - - lea r2, [r2 + 2 * r3] - lea r0, [r0 + 2 * r1] - dec r4d - jnz .loop - RET -%endmacro -FILTER_HOR_LUMA_W4_avx2 4 -FILTER_HOR_LUMA_W4_avx2 8 -FILTER_HOR_LUMA_W4_avx2 16 - -;------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -%macro FILTER_HOR_LUMA_W8 1 -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_8x%1, 4,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 6 - mov r4d, r4m - shl r4d, 4 -%ifdef PIC - lea r5, [tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4] - vpbroadcastq m1, [r5 + r4 + 8] -%else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [h_ab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - mova m7, [pd_32] - pxor m2, m2 - - ; register map - ; m0 , m1 interpolate coeff - - mov r4d, %1/2 - -.loop: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 8] - vbroadcasti128 m6, [r0 + 16] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2], xm4 - - vbroadcasti128 m4, [r0 + r1] - vbroadcasti128 m5, [r0 + r1 + 8] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + r1 + 8] - vbroadcasti128 m6, [r0 + r1 + 16] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + r3], xm4 - - lea r2, [r2 + 2 * r3] - lea r0, [r0 + 2 * r1] - dec r4d - jnz .loop - RET -%endmacro -FILTER_HOR_LUMA_W8 4 -FILTER_HOR_LUMA_W8 8 -FILTER_HOR_LUMA_W8 16 -FILTER_HOR_LUMA_W8 32 - -;------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -%macro FILTER_HOR_LUMA_W16 1 -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_16x%1, 4,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 6 - mov r4d, r4m - shl r4d, 4 -%ifdef PIC - lea r5, [tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4] - vpbroadcastq m1, [r5 + r4 + 8] -%else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - mova m7, [pd_32] - pxor m2, m2 - - ; register map - ; m0 , m1 interpolate coeff - - mov r4d, %1 - -.loop: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 8] - vbroadcasti128 m6, [r0 + 16] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2], xm4 - - vbroadcasti128 m4, [r0 + 16] - vbroadcasti128 m5, [r0 + 24] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 24] - vbroadcasti128 m6, [r0 + 32] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + 16], xm4 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop - RET -%endmacro -FILTER_HOR_LUMA_W16 4 -FILTER_HOR_LUMA_W16 8 -FILTER_HOR_LUMA_W16 12 -FILTER_HOR_LUMA_W16 16 -FILTER_HOR_LUMA_W16 32 -FILTER_HOR_LUMA_W16 64 - -;------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -%macro FILTER_HOR_LUMA_W32 2 -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_%1x%2, 4,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 6 - mov r4d, r4m - shl r4d, 4 -%ifdef PIC - lea r5, [tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4] - vpbroadcastq m1, [r5 + r4 + 8] -%else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - mova m7, [pd_32] - pxor m2, m2 - - ; register map - ; m0 , m1 interpolate coeff - - mov r4d, %2 - -.loop: -%assign x 0 -%rep %1/16 - vbroadcasti128 m4, [r0 + x] - vbroadcasti128 m5, [r0 + 8 + x] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 8 + x] - vbroadcasti128 m6, [r0 + 16 + x] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + x], xm4 - - vbroadcasti128 m4, [r0 + 16 + x] - vbroadcasti128 m5, [r0 + 24 + x] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 24 + x] - vbroadcasti128 m6, [r0 + 32 + x] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + 16 + x], xm4 - -%assign x x+32 -%endrep - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop - RET -%endmacro -FILTER_HOR_LUMA_W32 32, 8 -FILTER_HOR_LUMA_W32 32, 16 -FILTER_HOR_LUMA_W32 32, 24 -FILTER_HOR_LUMA_W32 32, 32 -FILTER_HOR_LUMA_W32 32, 64 -FILTER_HOR_LUMA_W32 64, 16 -FILTER_HOR_LUMA_W32 64, 32 -FILTER_HOR_LUMA_W32 64, 48 -FILTER_HOR_LUMA_W32 64, 64 - -;------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_12x16, 4,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 6 - mov r4d, r4m - shl r4d, 4 -%ifdef PIC - lea r5, [tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4] - vpbroadcastq m1, [r5 + r4 + 8] -%else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - mova m7, [pd_32] - pxor m2, m2 - - ; register map - ; m0 , m1 interpolate coeff - - mov r4d, 16 - -.loop: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 8] - vbroadcasti128 m6, [r0 + 16] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2], xm4 - - vbroadcasti128 m4, [r0 + 16] - vbroadcasti128 m5, [r0 + 24] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 24] - vbroadcasti128 m6, [r0 + 32] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movq [r2 + 16], xm4 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop - RET - -;------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_24x32, 4,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 6 - mov r4d, r4m - shl r4d, 4 -%ifdef PIC - lea r5, [tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4] - vpbroadcastq m1, [r5 + r4 + 8] -%else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - mova m7, [pd_32] - pxor m2, m2 - - ; register map - ; m0 , m1 interpolate coeff - - mov r4d, 32 - -.loop: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 8] - vbroadcasti128 m6, [r0 + 16] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2], xm4 - - vbroadcasti128 m4, [r0 + 16] - vbroadcasti128 m5, [r0 + 24] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 24] - vbroadcasti128 m6, [r0 + 32] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + 16], xm4 - - vbroadcasti128 m4, [r0 + 32] - vbroadcasti128 m5, [r0 + 40] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 40] - vbroadcasti128 m6, [r0 + 48] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + 32], xm4 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop - RET - -;------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_48x64, 4,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 6 - mov r4d, r4m - shl r4d, 4 -%ifdef PIC - lea r5, [tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4] - vpbroadcastq m1, [r5 + r4 + 8] -%else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - mova m7, [pd_32] - pxor m2, m2 - - ; register map - ; m0 , m1 interpolate coeff - - mov r4d, 64 - -.loop: -%assign x 0 -%rep 2 - vbroadcasti128 m4, [r0 + x] - vbroadcasti128 m5, [r0 + 8 + x] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 8 + x] - vbroadcasti128 m6, [r0 + 16 + x] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + x], xm4 - - vbroadcasti128 m4, [r0 + 16 + x] - vbroadcasti128 m5, [r0 + 24 + x] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 24 + x] - vbroadcasti128 m6, [r0 + 32 + x] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + 16 + x], xm4 - - vbroadcasti128 m4, [r0 + 32 + x] - vbroadcasti128 m5, [r0 + 40 + x] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 40 + x] - vbroadcasti128 m6, [r0 + 48 + x] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + 32 + x], xm4 - -%assign x x+48 -%endrep - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop - RET - -;----------------------------------------------------------------------------------------------------------------------------- -;void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;----------------------------------------------------------------------------------------------------------------------------- - -%macro IPFILTER_LUMA_PS_4xN_AVX2 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_8tap_horiz_ps_4x%1, 6,8,7 - mov r5d, r5m - mov r4d, r4m - add r1d, r1d - add r3d, r3d - -%ifdef PIC - lea r6, [tab_LumaCoeff] - lea r4, [r4 * 8] - vbroadcasti128 m0, [r6 + r4 * 2] -%else - lea r4, [r4 * 8] - vbroadcasti128 m0, [tab_LumaCoeff + r4 * 2] -%endif - - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - pw_2000 - - sub r0, 6 - test r5d, r5d - mov r7d, %1 ; loop count variable - height - jz .preloop - lea r6, [r1 * 3] ; r6 = (N / 2 - 1) * srcStride - sub r0, r6 ; r0(src) - 3 * srcStride - add r7d, 6 ;7 - 1(since last row not in loop) ; need extra 7 rows, just set a specially flag here, blkheight += N - 1 (7 - 3 = 4 ; since the last three rows not in loop) - -.preloop: - lea r6, [r3 * 3] -.loop: - ; Row 0 - movu xm3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - movu xm4, [r0 + 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - vinserti128 m3, m3, xm4, 1 - movu xm4, [r0 + 4] - movu xm5, [r0 + 6] - vinserti128 m4, m4, xm5, 1 - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] - - ; Row 1 - movu xm4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - movu xm5, [r0 + r1 + 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - vinserti128 m4, m4, xm5, 1 - movu xm5, [r0 + r1 + 4] - movu xm6, [r0 + r1 + 6] - vinserti128 m5, m5, xm6, 1 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] - phaddd m3, m4 ; all rows and col completed. - - mova m5, [interp8_hps_shuf] - vpermd m3, m5, m3 - paddd m3, m2 - vextracti128 xm4, m3, 1 - psrad xm3, INTERP_SHIFT_PS - psrad xm4, INTERP_SHIFT_PS - packssdw xm3, xm3 - packssdw xm4, xm4 - - movq [r2], xm3 ;row 0 - movq [r2 + r3], xm4 ;row 1 - lea r0, [r0 + r1 * 2] ; first loop src ->5th row(i.e 4) - lea r2, [r2 + r3 * 2] ; first loop dst ->5th row(i.e 4) - - sub r7d, 2 - jg .loop - test r5d, r5d - jz .end - - ; Row 10 - movu xm3, [r0] - movu xm4, [r0 + 2] - vinserti128 m3, m3, xm4, 1 - movu xm4, [r0 + 4] - movu xm5, [r0 + 6] - vinserti128 m4, m4, xm5, 1 - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - - ; Row11 - phaddd m3, m4 ; all rows and col completed. - - mova m5, [interp8_hps_shuf] - vpermd m3, m5, m3 - paddd m3, m2 - vextracti128 xm4, m3, 1 - psrad xm3, INTERP_SHIFT_PS - psrad xm4, INTERP_SHIFT_PS - packssdw xm3, xm3 - packssdw xm4, xm4 - - movq [r2], xm3 ;row 0 -.end: - RET -%endif -%endmacro - - IPFILTER_LUMA_PS_4xN_AVX2 4 - IPFILTER_LUMA_PS_4xN_AVX2 8 - IPFILTER_LUMA_PS_4xN_AVX2 16 - -%macro IPFILTER_LUMA_PS_8xN_AVX2 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_8tap_horiz_ps_8x%1, 4, 6, 8 - add r1d, r1d - add r3d, r3d - mov r4d, r4m - mov r5d, r5m - shl r4d, 4 -%ifdef PIC - lea r6, [tab_LumaCoeff] - vpbroadcastq m0, [r6 + r4] - vpbroadcastq m1, [r6 + r4 + 8] -%else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 , m1 interpolate coeff - - sub r0, 6 - test r5d, r5d - mov r4d, %1 - jz .loop0 - lea r6, [r1*3] - sub r0, r6 - add r4d, 7 - -.loop0: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m7, m5, m3 - pmaddwd m4, m0 - pmaddwd m7, m1 - paddd m4, m7 - - vbroadcasti128 m6, [r0 + 16] - pshufb m5, m3 - pshufb m6, m3 - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m2 - vextracti128 xm5,m4, 1 - psrad xm4, INTERP_SHIFT_PS - psrad xm5, INTERP_SHIFT_PS - packssdw xm4, xm5 - - movu [r2], xm4 - add r2, r3 - add r0, r1 - dec r4d - jnz .loop0 - RET -%endif -%endmacro - - IPFILTER_LUMA_PS_8xN_AVX2 4 - IPFILTER_LUMA_PS_8xN_AVX2 8 - IPFILTER_LUMA_PS_8xN_AVX2 16 - IPFILTER_LUMA_PS_8xN_AVX2 32 - -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_8tap_horiz_ps_24x32, 4, 6, 8 - add r1d, r1d - add r3d, r3d - mov r4d, r4m - mov r5d, r5m - shl r4d, 4 -%ifdef PIC - lea r6, [tab_LumaCoeff] - vpbroadcastq m0, [r6 + r4] - vpbroadcastq m1, [r6 + r4 + 8] -%else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 , m1 interpolate coeff - - sub r0, 6 - test r5d, r5d - mov r4d, 32 - jz .loop0 - lea r6, [r1*3] - sub r0, r6 - add r4d, 7 - -.loop0: -%assign x 0 -%rep 24/8 - vbroadcasti128 m4, [r0 + x] - vbroadcasti128 m5, [r0 + 8 + x] - pshufb m4, m3 - pshufb m7, m5, m3 - pmaddwd m4, m0 - pmaddwd m7, m1 - paddd m4, m7 - - vbroadcasti128 m6, [r0 + 16 + x] - pshufb m5, m3 - pshufb m6, m3 - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m2 - vextracti128 xm5,m4, 1 - psrad xm4, INTERP_SHIFT_PS - psrad xm5, INTERP_SHIFT_PS - packssdw xm4, xm5 - - movu [r2 + x], xm4 - %assign x x+16 - %endrep - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop0 - RET -%endif -%macro IPFILTER_LUMA_PS_32_64_AVX2 2 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_8tap_horiz_ps_%1x%2, 4, 6, 8 - - add r1d, r1d - add r3d, r3d - mov r4d, r4m - mov r5d, r5m - shl r4d, 6 -%ifdef PIC - lea r6, [h_tab_LumaCoeffV] - movu m0, [r6 + r4] - movu m1, [r6 + r4 + mmsize] -%else - movu m0, [h_tab_LumaCoeffV + r4] - movu m1, [h_tab_LumaCoeffV + r4 + mmsize] -%endif - mova m3, [interp8_hpp_shuf_new] - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 , m1 interpolate coeff - - sub r0, 6 - test r5d, r5d - mov r4d, %2 - jz .loop0 - lea r6, [r1*3] - sub r0, r6 - add r4d, 7 - -.loop0: -%assign x 0 -%rep %1/16 - vbroadcasti128 m4, [r0 + x] - vbroadcasti128 m5, [r0 + 4 * SIZEOF_PIXEL + x] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m7, m5, m1 - paddd m4, m7 - vextracti128 xm7, m4, 1 - paddd xm4, xm7 - paddd xm4, xm2 - psrad xm4, INTERP_SHIFT_PS - - vbroadcasti128 m6, [r0 + 16 + x] - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m7, m6, m1 - paddd m5, m7 - vextracti128 xm7, m5, 1 - paddd xm5, xm7 - paddd xm5, xm2 - psrad xm5, INTERP_SHIFT_PS - - packssdw xm4, xm5 - movu [r2 + x], xm4 - - vbroadcasti128 m5, [r0 + 24 + x] - pshufb m5, m3 - - pmaddwd m6, m0 - pmaddwd m7, m5, m1 - paddd m6, m7 - vextracti128 xm7, m6, 1 - paddd xm6, xm7 - paddd xm6, xm2 - psrad xm6, INTERP_SHIFT_PS - - vbroadcasti128 m7, [r0 + 32 + x] - pshufb m7, m3 - - pmaddwd m5, m0 - pmaddwd m7, m1 - paddd m5, m7 - vextracti128 xm7, m5, 1 - paddd xm5, xm7 - paddd xm5, xm2 - psrad xm5, INTERP_SHIFT_PS - - packssdw xm6, xm5 - movu [r2 + 16 + x], xm6 - -%assign x x+32 -%endrep - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop0 - RET -%endif -%endmacro - - IPFILTER_LUMA_PS_32_64_AVX2 32, 8 - IPFILTER_LUMA_PS_32_64_AVX2 32, 16 - IPFILTER_LUMA_PS_32_64_AVX2 32, 24 - IPFILTER_LUMA_PS_32_64_AVX2 32, 32 - IPFILTER_LUMA_PS_32_64_AVX2 32, 64 - - IPFILTER_LUMA_PS_32_64_AVX2 64, 16 - IPFILTER_LUMA_PS_32_64_AVX2 64, 32 - IPFILTER_LUMA_PS_32_64_AVX2 64, 48 - IPFILTER_LUMA_PS_32_64_AVX2 64, 64 - - IPFILTER_LUMA_PS_32_64_AVX2 48, 64 - -%macro IPFILTER_LUMA_PS_16xN_AVX2 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_8tap_horiz_ps_16x%1, 4, 6, 8 - - add r1d, r1d - add r3d, r3d - mov r4d, r4m - mov r5d, r5m - shl r4d, 4 -%ifdef PIC - lea r6, [tab_LumaCoeff] - vpbroadcastq m0, [r6 + r4] - vpbroadcastq m1, [r6 + r4 + 8] -%else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 , m1 interpolate coeff - - sub r0, 6 - test r5d, r5d - mov r4d, %1 - jz .loop0 - lea r6, [r1*3] - sub r0, r6 - add r4d, 7 - -.loop0: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m7, m5, m3 - pmaddwd m4, m0 - pmaddwd m7, m1 - paddd m4, m7 - - vbroadcasti128 m6, [r0 + 16] - pshufb m5, m3 - pshufb m7, m6, m3 - pmaddwd m5, m0 - pmaddwd m7, m1 - paddd m5, m7 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m2 - vextracti128 xm5, m4, 1 - psrad xm4, INTERP_SHIFT_PS - psrad xm5, INTERP_SHIFT_PS - packssdw xm4, xm5 - movu [r2], xm4 - - vbroadcasti128 m5, [r0 + 24] - pshufb m6, m3 - pshufb m7, m5, m3 - pmaddwd m6, m0 - pmaddwd m7, m1 - paddd m6, m7 - - vbroadcasti128 m7, [r0 + 32] - pshufb m5, m3 - pshufb m7, m3 - pmaddwd m5, m0 - pmaddwd m7, m1 - paddd m5, m7 - - phaddd m6, m5 - vpermq m6, m6, q3120 - paddd m6, m2 - vextracti128 xm5,m6, 1 - psrad xm6, INTERP_SHIFT_PS - psrad xm5, INTERP_SHIFT_PS - packssdw xm6, xm5 - movu [r2 + 16], xm6 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop0 - RET -%endif -%endmacro - - IPFILTER_LUMA_PS_16xN_AVX2 4 - IPFILTER_LUMA_PS_16xN_AVX2 8 - IPFILTER_LUMA_PS_16xN_AVX2 12 - IPFILTER_LUMA_PS_16xN_AVX2 16 - IPFILTER_LUMA_PS_16xN_AVX2 32 - IPFILTER_LUMA_PS_16xN_AVX2 64 - -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_8tap_horiz_ps_12x16, 4, 6, 8 - add r1d, r1d - add r3d, r3d - mov r4d, r4m - mov r5d, r5m - shl r4d, 4 -%ifdef PIC - lea r6, [tab_LumaCoeff] - vpbroadcastq m0, [r6 + r4] - vpbroadcastq m1, [r6 + r4 + 8] -%else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 , m1 interpolate coeff - - sub r0, 6 - test r5d, r5d - mov r4d, 16 - jz .loop0 - lea r6, [r1*3] - sub r0, r6 - add r4d, 7 - -.loop0: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m7, m5, m3 - pmaddwd m4, m0 - pmaddwd m7, m1 - paddd m4, m7 - - vbroadcasti128 m6, [r0 + 16] - pshufb m5, m3 - pshufb m7, m6, m3 - pmaddwd m5, m0 - pmaddwd m7, m1 - paddd m5, m7 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m2 - vextracti128 xm5,m4, 1 - psrad xm4, INTERP_SHIFT_PS - psrad xm5, INTERP_SHIFT_PS - packssdw xm4, xm5 - movu [r2], xm4 - - vbroadcasti128 m5, [r0 + 24] - pshufb m6, m3 - pshufb m5, m3 - pmaddwd m6, m0 - pmaddwd m5, m1 - paddd m6, m5 - - phaddd m6, m6 - vpermq m6, m6, q3120 - paddd xm6, xm2 - psrad xm6, INTERP_SHIFT_PS - packssdw xm6, xm6 - movq [r2 + 16], xm6 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop0 - RET -%endif
View file
x265_2.7.tar.gz/source/common/x86/h-ipfilter8.asm
Deleted
@@ -1,6736 +0,0 @@ -;***************************************************************************** -;* Copyright (C) 2013-2017 MulticoreWare, Inc -;* -;* Authors: Min Chen <chenm003@163.com> -;* Nabajit Deka <nabajit@multicorewareinc.com> -;* Praveen Kumar Tiwari <praveen@multicorewareinc.com> -;* -;* This program is free software; you can redistribute it and/or modify -;* it under the terms of the GNU General Public License as published by -;* the Free Software Foundation; either version 2 of the License, or -;* (at your option) any later version. -;* -;* This program is distributed in the hope that it will be useful, -;* but WITHOUT ANY WARRANTY; without even the implied warranty of -;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -;* GNU General Public License for more details. -;* -;* You should have received a copy of the GNU General Public License -;* along with this program; if not, write to the Free Software -;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. -;* -;* This program is also available under a commercial proprietary license. -;* For more information, contact us at license @ x265.com. -;*****************************************************************************/ - -%include "x86inc.asm" -%include "x86util.asm" - -SECTION_RODATA 32 - -const h_tabw_LumaCoeff, dw 0, 0, 0, 64, 0, 0, 0, 0 - dw -1, 4, -10, 58, 17, -5, 1, 0 - dw -1, 4, -11, 40, 40, -11, 4, -1 - dw 0, 1, -5, 17, 58, -10, 4, -1 - -const h_tabw_ChromaCoeff, dw 0, 64, 0, 0 - dw -2, 58, 10, -2 - dw -4, 54, 16, -2 - dw -6, 46, 28, -4 - dw -4, 36, 36, -4 - dw -4, 28, 46, -6 - dw -2, 16, 54, -4 - dw -2, 10, 58, -2 - -const h_tab_ChromaCoeff, db 0, 64, 0, 0 - db -2, 58, 10, -2 - db -4, 54, 16, -2 - db -6, 46, 28, -4 - db -4, 36, 36, -4 - db -4, 28, 46, -6 - db -2, 16, 54, -4 - db -2, 10, 58, -2 - -const h_tab_LumaCoeff, db 0, 0, 0, 64, 0, 0, 0, 0 - db -1, 4, -10, 58, 17, -5, 1, 0 - db -1, 4, -11, 40, 40, -11, 4, -1 - db 0, 1, -5, 17, 58, -10, 4, -1 - -const h_tab_Tm, db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6 - db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10 - db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14 - - -const h_tab_Lm, db 0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8 - db 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10 - db 4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12 - db 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14 - -const interp4_shuf, times 2 db 0, 1, 8, 9, 4, 5, 12, 13, 2, 3, 10, 11, 6, 7, 14, 15 - -const h_interp8_hps_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7 - -const interp4_hpp_shuf, times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12 - -const interp4_horiz_shuf1, db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6 - db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 - -const h_pd_526336, times 8 dd 8192*64+2048 - -const pb_LumaCoeffVer, times 16 db 0, 0 - times 16 db 0, 64 - times 16 db 0, 0 - times 16 db 0, 0 - - times 16 db -1, 4 - times 16 db -10, 58 - times 16 db 17, -5 - times 16 db 1, 0 - - times 16 db -1, 4 - times 16 db -11, 40 - times 16 db 40, -11 - times 16 db 4, -1 - - times 16 db 0, 1 - times 16 db -5, 17 - times 16 db 58, -10 - times 16 db 4, -1 - -const h_pw_LumaCoeffVer, times 8 dw 0, 0 - times 8 dw 0, 64 - times 8 dw 0, 0 - times 8 dw 0, 0 - - times 8 dw -1, 4 - times 8 dw -10, 58 - times 8 dw 17, -5 - times 8 dw 1, 0 - - times 8 dw -1, 4 - times 8 dw -11, 40 - times 8 dw 40, -11 - times 8 dw 4, -1 - - times 8 dw 0, 1 - times 8 dw -5, 17 - times 8 dw 58, -10 - times 8 dw 4, -1 - -const pb_8tap_hps_0, times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 - times 2 db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10 - times 2 db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12 - times 2 db 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12,12,13,13,14 - -ALIGN 32 -interp4_hps_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12 - -SECTION .text - -cextern pw_1 -cextern pw_32 -cextern pw_2000 -cextern pw_512 - -%macro PROCESS_LUMA_AVX2_W8_16R 1 - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - paddd m0, m4 - pmaddwd m2, [r5] - lea r7, [r0 + r1 * 4] - movu xm4, [r7] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - paddd m1, m5 - pmaddwd m3, [r5] - movu xm5, [r7 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 2 * mmsize] - paddd m0, m6 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m2, m6 - pmaddwd m4, [r5] - movu xm6, [r7 + r1 * 2] ; m6 = row 6 - punpckhwd xm7, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm7, 1 - pmaddwd m7, m5, [r5 + 2 * mmsize] - paddd m1, m7 - pmaddwd m7, m5, [r5 + 1 * mmsize] - paddd m3, m7 - pmaddwd m5, [r5] - movu xm7, [r7 + r4] ; m7 = row 7 - punpckhwd xm8, xm6, xm7 - punpcklwd xm6, xm7 - vinserti128 m6, m6, xm8, 1 - pmaddwd m8, m6, [r5 + 3 * mmsize] - paddd m0, m8 - pmaddwd m8, m6, [r5 + 2 * mmsize] - paddd m2, m8 - pmaddwd m8, m6, [r5 + 1 * mmsize] - paddd m4, m8 - pmaddwd m6, [r5] - lea r7, [r7 + r1 * 4] - movu xm8, [r7] ; m8 = row 8 - punpckhwd xm9, xm7, xm8 - punpcklwd xm7, xm8 - vinserti128 m7, m7, xm9, 1 - pmaddwd m9, m7, [r5 + 3 * mmsize] - paddd m1, m9 - pmaddwd m9, m7, [r5 + 2 * mmsize] - paddd m3, m9 - pmaddwd m9, m7, [r5 + 1 * mmsize] - paddd m5, m9 - pmaddwd m7, [r5] - movu xm9, [r7 + r1] ; m9 = row 9 - punpckhwd xm10, xm8, xm9 - punpcklwd xm8, xm9 - vinserti128 m8, m8, xm10, 1 - pmaddwd m10, m8, [r5 + 3 * mmsize] - paddd m2, m10 - pmaddwd m10, m8, [r5 + 2 * mmsize] - paddd m4, m10 - pmaddwd m10, m8, [r5 + 1 * mmsize] - paddd m6, m10 - pmaddwd m8, [r5] - movu xm10, [r7 + r1 * 2] ; m10 = row 10 - punpckhwd xm11, xm9, xm10 - punpcklwd xm9, xm10 - vinserti128 m9, m9, xm11, 1 - pmaddwd m11, m9, [r5 + 3 * mmsize] - paddd m3, m11 - pmaddwd m11, m9, [r5 + 2 * mmsize] - paddd m5, m11 - pmaddwd m11, m9, [r5 + 1 * mmsize] - paddd m7, m11 - pmaddwd m9, [r5] - movu xm11, [r7 + r4] ; m11 = row 11 - punpckhwd xm12, xm10, xm11 - punpcklwd xm10, xm11 - vinserti128 m10, m10, xm12, 1 - pmaddwd m12, m10, [r5 + 3 * mmsize] - paddd m4, m12 - pmaddwd m12, m10, [r5 + 2 * mmsize] - paddd m6, m12 - pmaddwd m12, m10, [r5 + 1 * mmsize] - paddd m8, m12 - pmaddwd m10, [r5] - lea r7, [r7 + r1 * 4] - movu xm12, [r7] ; m12 = row 12 - punpckhwd xm13, xm11, xm12 - punpcklwd xm11, xm12 - vinserti128 m11, m11, xm13, 1 - pmaddwd m13, m11, [r5 + 3 * mmsize] - paddd m5, m13 - pmaddwd m13, m11, [r5 + 2 * mmsize] - paddd m7, m13 - pmaddwd m13, m11, [r5 + 1 * mmsize] - paddd m9, m13 - pmaddwd m11, [r5] - -%ifidn %1,sp - paddd m0, m14 - paddd m1, m14 - paddd m2, m14 - paddd m3, m14 - paddd m4, m14 - paddd m5, m14 - psrad m0, 12 - psrad m1, 12 - psrad m2, 12 - psrad m3, 12 - psrad m4, 12 - psrad m5, 12 -%else - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - psrad m4, 6 - psrad m5, 6 -%endif - packssdw m0, m1 - packssdw m2, m3 - packssdw m4, m5 -%ifidn %1,sp - packuswb m0, m2 - mova m5, [h_interp8_hps_shuf] - vpermd m0, m5, m0 - vextracti128 xm2, m0, 1 - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r6], xm2 -%else - vpermq m0, m0, 11011000b - vpermq m2, m2, 11011000b - vextracti128 xm1, m0, 1 - vextracti128 xm3, m2, 1 - movu [r2], xm0 - movu [r2 + r3], xm1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm3 -%endif - - movu xm13, [r7 + r1] ; m13 = row 13 - punpckhwd xm0, xm12, xm13 - punpcklwd xm12, xm13 - vinserti128 m12, m12, xm0, 1 - pmaddwd m0, m12, [r5 + 3 * mmsize] - paddd m6, m0 - pmaddwd m0, m12, [r5 + 2 * mmsize] - paddd m8, m0 - pmaddwd m0, m12, [r5 + 1 * mmsize] - paddd m10, m0 - pmaddwd m12, [r5] - movu xm0, [r7 + r1 * 2] ; m0 = row 14 - punpckhwd xm1, xm13, xm0 - punpcklwd xm13, xm0 - vinserti128 m13, m13, xm1, 1 - pmaddwd m1, m13, [r5 + 3 * mmsize] - paddd m7, m1 - pmaddwd m1, m13, [r5 + 2 * mmsize] - paddd m9, m1 - pmaddwd m1, m13, [r5 + 1 * mmsize] - paddd m11, m1 - pmaddwd m13, [r5] - -%ifidn %1,sp - paddd m6, m14 - paddd m7, m14 - psrad m6, 12 - psrad m7, 12 -%else - psrad m6, 6 - psrad m7, 6 -%endif - packssdw m6, m7 - lea r8, [r2 + r3 * 4] - -%ifidn %1,sp - packuswb m4, m6 - vpermd m4, m5, m4 - vextracti128 xm6, m4, 1 - movq [r8], xm4 - movhps [r8 + r3], xm4 - movq [r8 + r3 * 2], xm6 - movhps [r8 + r6], xm6 -%else - vpermq m4, m4, 11011000b - vpermq m6, m6, 11011000b - vextracti128 xm1, m4, 1 - vextracti128 xm7, m6, 1 - movu [r8], xm4 - movu [r8 + r3], xm1 - movu [r8 + r3 * 2], xm6 - movu [r8 + r6], xm7 -%endif - - movu xm1, [r7 + r4] ; m1 = row 15 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m2, m0, [r5 + 3 * mmsize] - paddd m8, m2 - pmaddwd m2, m0, [r5 + 2 * mmsize] - paddd m10, m2 - pmaddwd m2, m0, [r5 + 1 * mmsize] - paddd m12, m2 - pmaddwd m0, [r5] - lea r7, [r7 + r1 * 4] - movu xm2, [r7] ; m2 = row 16 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m3, m1, [r5 + 3 * mmsize] - paddd m9, m3 - pmaddwd m3, m1, [r5 + 2 * mmsize] - paddd m11, m3 - pmaddwd m3, m1, [r5 + 1 * mmsize] - paddd m13, m3 - pmaddwd m1, [r5] - movu xm3, [r7 + r1] ; m3 = row 17 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 3 * mmsize] - paddd m10, m4 - pmaddwd m4, m2, [r5 + 2 * mmsize] - paddd m12, m4 - pmaddwd m2, [r5 + 1 * mmsize] - paddd m0, m2 - movu xm4, [r7 + r1 * 2] ; m4 = row 18 - punpckhwd xm2, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm2, 1 - pmaddwd m2, m3, [r5 + 3 * mmsize] - paddd m11, m2 - pmaddwd m2, m3, [r5 + 2 * mmsize] - paddd m13, m2 - pmaddwd m3, [r5 + 1 * mmsize] - paddd m1, m3 - movu xm2, [r7 + r4] ; m2 = row 19 - punpckhwd xm6, xm4, xm2 - punpcklwd xm4, xm2 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 3 * mmsize] - paddd m12, m6 - pmaddwd m4, [r5 + 2 * mmsize] - paddd m0, m4 - lea r7, [r7 + r1 * 4] - movu xm6, [r7] ; m6 = row 20 - punpckhwd xm7, xm2, xm6 - punpcklwd xm2, xm6 - vinserti128 m2, m2, xm7, 1 - pmaddwd m7, m2, [r5 + 3 * mmsize] - paddd m13, m7 - pmaddwd m2, [r5 + 2 * mmsize] - paddd m1, m2 - movu xm7, [r7 + r1] ; m7 = row 21 - punpckhwd xm2, xm6, xm7 - punpcklwd xm6, xm7 - vinserti128 m6, m6, xm2, 1 - pmaddwd m6, [r5 + 3 * mmsize] - paddd m0, m6 - movu xm2, [r7 + r1 * 2] ; m2 = row 22 - punpckhwd xm3, xm7, xm2 - punpcklwd xm7, xm2 - vinserti128 m7, m7, xm3, 1 - pmaddwd m7, [r5 + 3 * mmsize] - paddd m1, m7 - -%ifidn %1,sp - paddd m8, m14 - paddd m9, m14 - paddd m10, m14 - paddd m11, m14 - paddd m12, m14 - paddd m13, m14 - paddd m0, m14 - paddd m1, m14 - psrad m8, 12 - psrad m9, 12 - psrad m10, 12 - psrad m11, 12 - psrad m12, 12 - psrad m13, 12 - psrad m0, 12 - psrad m1, 12 -%else - psrad m8, 6 - psrad m9, 6 - psrad m10, 6 - psrad m11, 6 - psrad m12, 6 - psrad m13, 6 - psrad m0, 6 - psrad m1, 6 -%endif - packssdw m8, m9 - packssdw m10, m11 - packssdw m12, m13 - packssdw m0, m1 - lea r8, [r8 + r3 * 4] - -%ifidn %1,sp - packuswb m8, m10 - packuswb m12, m0 - vpermd m8, m5, m8 - vpermd m12, m5, m12 - vextracti128 xm10, m8, 1 - vextracti128 xm0, m12, 1 - movq [r8], xm8 - movhps [r8 + r3], xm8 - movq [r8 + r3 * 2], xm10 - movhps [r8 + r6], xm10 - lea r8, [r8 + r3 * 4] - movq [r8], xm12 - movhps [r8 + r3], xm12 - movq [r8 + r3 * 2], xm0 - movhps [r8 + r6], xm0 -%else - vpermq m8, m8, 11011000b - vpermq m10, m10, 11011000b - vpermq m12, m12, 11011000b - vpermq m0, m0, 11011000b - vextracti128 xm9, m8, 1 - vextracti128 xm11, m10, 1 - vextracti128 xm13, m12, 1 - vextracti128 xm1, m0, 1 - movu [r8], xm8 - movu [r8 + r3], xm9 - movu [r8 + r3 * 2], xm10 - movu [r8 + r6], xm11 - lea r8, [r8 + r3 * 4] - movu [r8], xm12 - movu [r8 + r3], xm13 - movu [r8 + r3 * 2], xm0 - movu [r8 + r6], xm1 -%endif -%endmacro - -%macro FILTER_H8_W8_sse2 0 - movh m1, [r0 + x - 3] - movh m4, [r0 + x - 2] - punpcklbw m1, m6 - punpcklbw m4, m6 - movh m5, [r0 + x - 1] - movh m0, [r0 + x] - punpcklbw m5, m6 - punpcklbw m0, m6 - pmaddwd m1, m3 - pmaddwd m4, m3 - pmaddwd m5, m3 - pmaddwd m0, m3 - packssdw m1, m4 - packssdw m5, m0 - pshuflw m4, m1, q2301 - pshufhw m4, m4, q2301 - pshuflw m0, m5, q2301 - pshufhw m0, m0, q2301 - paddw m1, m4 - paddw m5, m0 - psrldq m1, 2 - psrldq m5, 2 - pshufd m1, m1, q3120 - pshufd m5, m5, q3120 - punpcklqdq m1, m5 - movh m7, [r0 + x + 1] - movh m4, [r0 + x + 2] - punpcklbw m7, m6 - punpcklbw m4, m6 - movh m5, [r0 + x + 3] - movh m0, [r0 + x + 4] - punpcklbw m5, m6 - punpcklbw m0, m6 - pmaddwd m7, m3 - pmaddwd m4, m3 - pmaddwd m5, m3 - pmaddwd m0, m3 - packssdw m7, m4 - packssdw m5, m0 - pshuflw m4, m7, q2301 - pshufhw m4, m4, q2301 - pshuflw m0, m5, q2301 - pshufhw m0, m0, q2301 - paddw m7, m4 - paddw m5, m0 - psrldq m7, 2 - psrldq m5, 2 - pshufd m7, m7, q3120 - pshufd m5, m5, q3120 - punpcklqdq m7, m5 - pshuflw m4, m1, q2301 - pshufhw m4, m4, q2301 - pshuflw m0, m7, q2301 - pshufhw m0, m0, q2301 - paddw m1, m4 - paddw m7, m0 - psrldq m1, 2 - psrldq m7, 2 - pshufd m1, m1, q3120 - pshufd m7, m7, q3120 - punpcklqdq m1, m7 -%endmacro - -%macro FILTER_H8_W4_sse2 0 - movh m1, [r0 + x - 3] - movh m0, [r0 + x - 2] - punpcklbw m1, m6 - punpcklbw m0, m6 - movh m4, [r0 + x - 1] - movh m5, [r0 + x] - punpcklbw m4, m6 - punpcklbw m5, m6 - pmaddwd m1, m3 - pmaddwd m0, m3 - pmaddwd m4, m3 - pmaddwd m5, m3 - packssdw m1, m0 - packssdw m4, m5 - pshuflw m0, m1, q2301 - pshufhw m0, m0, q2301 - pshuflw m5, m4, q2301 - pshufhw m5, m5, q2301 - paddw m1, m0 - paddw m4, m5 - psrldq m1, 2 - psrldq m4, 2 - pshufd m1, m1, q3120 - pshufd m4, m4, q3120 - punpcklqdq m1, m4 - pshuflw m0, m1, q2301 - pshufhw m0, m0, q2301 - paddw m1, m0 - psrldq m1, 2 - pshufd m1, m1, q3120 -%endmacro - -;---------------------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;---------------------------------------------------------------------------------------------------------------------------- -%macro IPFILTER_LUMA_sse2 3 -INIT_XMM sse2 -cglobal interp_8tap_horiz_%3_%1x%2, 4,6,8 - mov r4d, r4m - add r4d, r4d - pxor m6, m6 - -%ifidn %3, ps - add r3d, r3d - cmp r5m, byte 0 -%endif - -%ifdef PIC - lea r5, [h_tabw_LumaCoeff] - movu m3, [r5 + r4 * 8] -%else - movu m3, [h_tabw_LumaCoeff + r4 * 8] -%endif - - mov r4d, %2 - -%ifidn %3, pp - mova m2, [pw_32] -%else - mova m2, [pw_2000] - je .loopH - lea r5, [r1 + 2 * r1] - sub r0, r5 - add r4d, 7 -%endif - -.loopH: -%assign x 0 -%rep %1 / 8 - FILTER_H8_W8_sse2 - %ifidn %3, pp - paddw m1, m2 - psraw m1, 6 - packuswb m1, m1 - movh [r2 + x], m1 - %else - psubw m1, m2 - movu [r2 + 2 * x], m1 - %endif -%assign x x+8 -%endrep - -%rep (%1 % 8) / 4 - FILTER_H8_W4_sse2 - %ifidn %3, pp - paddw m1, m2 - psraw m1, 6 - packuswb m1, m1 - movd [r2 + x], m1 - %else - psubw m1, m2 - movh [r2 + 2 * x], m1 - %endif -%endrep - - add r0, r1 - add r2, r3 - - dec r4d - jnz .loopH - RET - -%endmacro - -;-------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;-------------------------------------------------------------------------------------------------------------- - IPFILTER_LUMA_sse2 4, 4, pp - IPFILTER_LUMA_sse2 4, 8, pp - IPFILTER_LUMA_sse2 8, 4, pp - IPFILTER_LUMA_sse2 8, 8, pp - IPFILTER_LUMA_sse2 16, 16, pp - IPFILTER_LUMA_sse2 16, 8, pp - IPFILTER_LUMA_sse2 8, 16, pp - IPFILTER_LUMA_sse2 16, 12, pp - IPFILTER_LUMA_sse2 12, 16, pp - IPFILTER_LUMA_sse2 16, 4, pp - IPFILTER_LUMA_sse2 4, 16, pp - IPFILTER_LUMA_sse2 32, 32, pp - IPFILTER_LUMA_sse2 32, 16, pp - IPFILTER_LUMA_sse2 16, 32, pp - IPFILTER_LUMA_sse2 32, 24, pp - IPFILTER_LUMA_sse2 24, 32, pp - IPFILTER_LUMA_sse2 32, 8, pp - IPFILTER_LUMA_sse2 8, 32, pp - IPFILTER_LUMA_sse2 64, 64, pp - IPFILTER_LUMA_sse2 64, 32, pp - IPFILTER_LUMA_sse2 32, 64, pp - IPFILTER_LUMA_sse2 64, 48, pp - IPFILTER_LUMA_sse2 48, 64, pp - IPFILTER_LUMA_sse2 64, 16, pp - IPFILTER_LUMA_sse2 16, 64, pp - -;---------------------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;---------------------------------------------------------------------------------------------------------------------------- - IPFILTER_LUMA_sse2 4, 4, ps - IPFILTER_LUMA_sse2 8, 8, ps - IPFILTER_LUMA_sse2 8, 4, ps - IPFILTER_LUMA_sse2 4, 8, ps - IPFILTER_LUMA_sse2 16, 16, ps - IPFILTER_LUMA_sse2 16, 8, ps - IPFILTER_LUMA_sse2 8, 16, ps - IPFILTER_LUMA_sse2 16, 12, ps - IPFILTER_LUMA_sse2 12, 16, ps - IPFILTER_LUMA_sse2 16, 4, ps - IPFILTER_LUMA_sse2 4, 16, ps - IPFILTER_LUMA_sse2 32, 32, ps - IPFILTER_LUMA_sse2 32, 16, ps - IPFILTER_LUMA_sse2 16, 32, ps - IPFILTER_LUMA_sse2 32, 24, ps - IPFILTER_LUMA_sse2 24, 32, ps - IPFILTER_LUMA_sse2 32, 8, ps - IPFILTER_LUMA_sse2 8, 32, ps - IPFILTER_LUMA_sse2 64, 64, ps - IPFILTER_LUMA_sse2 64, 32, ps - IPFILTER_LUMA_sse2 32, 64, ps - IPFILTER_LUMA_sse2 64, 48, ps - IPFILTER_LUMA_sse2 48, 64, ps - IPFILTER_LUMA_sse2 64, 16, ps - IPFILTER_LUMA_sse2 16, 64, ps - -%macro FILTER_H4_w2_2_sse2 0 - pxor m3, m3 - movd m0, [srcq - 1] - movd m2, [srcq] - punpckldq m0, m2 - punpcklbw m0, m3 - movd m1, [srcq + srcstrideq - 1] - movd m2, [srcq + srcstrideq] - punpckldq m1, m2 - punpcklbw m1, m3 - pmaddwd m0, m4 - pmaddwd m1, m4 - packssdw m0, m1 - pshuflw m1, m0, q2301 - pshufhw m1, m1, q2301 - paddw m0, m1 - psrld m0, 16 - packssdw m0, m0 - paddw m0, m5 - psraw m0, 6 - packuswb m0, m0 - movd r4d, m0 - mov [dstq], r4w - shr r4, 16 - mov [dstq + dststrideq], r4w -%endmacro - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_2xN(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_H4_W2xN_sse3 1 -INIT_XMM sse3 -cglobal interp_4tap_horiz_pp_2x%1, 4, 6, 6, src, srcstride, dst, dststride - mov r4d, r4m - mova m5, [pw_32] - -%ifdef PIC - lea r5, [h_tabw_ChromaCoeff] - movddup m4, [r5 + r4 * 8] -%else - movddup m4, [h_tabw_ChromaCoeff + r4 * 8] -%endif - -%assign x 1 -%rep %1/2 - FILTER_H4_w2_2_sse2 -%if x < %1/2 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] -%endif -%assign x x+1 -%endrep - - RET - -%endmacro - - FILTER_H4_W2xN_sse3 4 - FILTER_H4_W2xN_sse3 8 - FILTER_H4_W2xN_sse3 16 - -%macro FILTER_H4_w4_2_sse2 0 - pxor m5, m5 - movd m0, [srcq - 1] - movd m6, [srcq] - punpckldq m0, m6 - punpcklbw m0, m5 - movd m1, [srcq + 1] - movd m6, [srcq + 2] - punpckldq m1, m6 - punpcklbw m1, m5 - movd m2, [srcq + srcstrideq - 1] - movd m6, [srcq + srcstrideq] - punpckldq m2, m6 - punpcklbw m2, m5 - movd m3, [srcq + srcstrideq + 1] - movd m6, [srcq + srcstrideq + 2] - punpckldq m3, m6 - punpcklbw m3, m5 - pmaddwd m0, m4 - pmaddwd m1, m4 - pmaddwd m2, m4 - pmaddwd m3, m4 - packssdw m0, m1 - packssdw m2, m3 - pshuflw m1, m0, q2301 - pshufhw m1, m1, q2301 - pshuflw m3, m2, q2301 - pshufhw m3, m3, q2301 - paddw m0, m1 - paddw m2, m3 - psrld m0, 16 - psrld m2, 16 - packssdw m0, m2 - paddw m0, m7 - psraw m0, 6 - packuswb m0, m2 - movd [dstq], m0 - psrldq m0, 4 - movd [dstq + dststrideq], m0 -%endmacro - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_4x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_H4_W4xN_sse3 1 -INIT_XMM sse3 -cglobal interp_4tap_horiz_pp_4x%1, 4, 6, 8, src, srcstride, dst, dststride - mov r4d, r4m - mova m7, [pw_32] - -%ifdef PIC - lea r5, [h_tabw_ChromaCoeff] - movddup m4, [r5 + r4 * 8] -%else - movddup m4, [h_tabw_ChromaCoeff + r4 * 8] -%endif - -%assign x 1 -%rep %1/2 - FILTER_H4_w4_2_sse2 -%if x < %1/2 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] -%endif -%assign x x+1 -%endrep - - RET - -%endmacro - - FILTER_H4_W4xN_sse3 2 - FILTER_H4_W4xN_sse3 4 - FILTER_H4_W4xN_sse3 8 - FILTER_H4_W4xN_sse3 16 - FILTER_H4_W4xN_sse3 32 - -%macro FILTER_H4_w6_sse2 0 - pxor m4, m4 - movh m0, [srcq - 1] - movh m5, [srcq] - punpckldq m0, m5 - movhlps m2, m0 - punpcklbw m0, m4 - punpcklbw m2, m4 - movd m1, [srcq + 1] - movd m5, [srcq + 2] - punpckldq m1, m5 - punpcklbw m1, m4 - pmaddwd m0, m6 - pmaddwd m1, m6 - pmaddwd m2, m6 - packssdw m0, m1 - packssdw m2, m2 - pshuflw m1, m0, q2301 - pshufhw m1, m1, q2301 - pshuflw m3, m2, q2301 - paddw m0, m1 - paddw m2, m3 - psrld m0, 16 - psrld m2, 16 - packssdw m0, m2 - paddw m0, m7 - psraw m0, 6 - packuswb m0, m0 - movd [dstq], m0 - pextrw r4d, m0, 2 - mov [dstq + 4], r4w -%endmacro - -%macro FILH4W8_sse2 1 - movh m0, [srcq - 1 + %1] - movh m5, [srcq + %1] - punpckldq m0, m5 - movhlps m2, m0 - punpcklbw m0, m4 - punpcklbw m2, m4 - movh m1, [srcq + 1 + %1] - movh m5, [srcq + 2 + %1] - punpckldq m1, m5 - movhlps m3, m1 - punpcklbw m1, m4 - punpcklbw m3, m4 - pmaddwd m0, m6 - pmaddwd m1, m6 - pmaddwd m2, m6 - pmaddwd m3, m6 - packssdw m0, m1 - packssdw m2, m3 - pshuflw m1, m0, q2301 - pshufhw m1, m1, q2301 - pshuflw m3, m2, q2301 - pshufhw m3, m3, q2301 - paddw m0, m1 - paddw m2, m3 - psrld m0, 16 - psrld m2, 16 - packssdw m0, m2 - paddw m0, m7 - psraw m0, 6 - packuswb m0, m0 - movh [dstq + %1], m0 -%endmacro - -%macro FILTER_H4_w8_sse2 0 - FILH4W8_sse2 0 -%endmacro - -%macro FILTER_H4_w12_sse2 0 - FILH4W8_sse2 0 - movd m1, [srcq - 1 + 8] - movd m3, [srcq + 8] - punpckldq m1, m3 - punpcklbw m1, m4 - movd m2, [srcq + 1 + 8] - movd m3, [srcq + 2 + 8] - punpckldq m2, m3 - punpcklbw m2, m4 - pmaddwd m1, m6 - pmaddwd m2, m6 - packssdw m1, m2 - pshuflw m2, m1, q2301 - pshufhw m2, m2, q2301 - paddw m1, m2 - psrld m1, 16 - packssdw m1, m1 - paddw m1, m7 - psraw m1, 6 - packuswb m1, m1 - movd [dstq + 8], m1 -%endmacro - -%macro FILTER_H4_w16_sse2 0 - FILH4W8_sse2 0 - FILH4W8_sse2 8 -%endmacro - -%macro FILTER_H4_w24_sse2 0 - FILH4W8_sse2 0 - FILH4W8_sse2 8 - FILH4W8_sse2 16 -%endmacro - -%macro FILTER_H4_w32_sse2 0 - FILH4W8_sse2 0 - FILH4W8_sse2 8 - FILH4W8_sse2 16 - FILH4W8_sse2 24 -%endmacro - -%macro FILTER_H4_w48_sse2 0 - FILH4W8_sse2 0 - FILH4W8_sse2 8 - FILH4W8_sse2 16 - FILH4W8_sse2 24 - FILH4W8_sse2 32 - FILH4W8_sse2 40 -%endmacro - -%macro FILTER_H4_w64_sse2 0 - FILH4W8_sse2 0 - FILH4W8_sse2 8 - FILH4W8_sse2 16 - FILH4W8_sse2 24 - FILH4W8_sse2 32 - FILH4W8_sse2 40 - FILH4W8_sse2 48 - FILH4W8_sse2 56 -%endmacro - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro IPFILTER_CHROMA_sse3 2 -INIT_XMM sse3 -cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 8, src, srcstride, dst, dststride - mov r4d, r4m - mova m7, [pw_32] - pxor m4, m4 - -%ifdef PIC - lea r5, [h_tabw_ChromaCoeff] - movddup m6, [r5 + r4 * 8] -%else - movddup m6, [h_tabw_ChromaCoeff + r4 * 8] -%endif - -%assign x 1 -%rep %2 - FILTER_H4_w%1_sse2 -%if x < %2 - add srcq, srcstrideq - add dstq, dststrideq -%endif -%assign x x+1 -%endrep - - RET - -%endmacro - - IPFILTER_CHROMA_sse3 6, 8 - IPFILTER_CHROMA_sse3 8, 2 - IPFILTER_CHROMA_sse3 8, 4 - IPFILTER_CHROMA_sse3 8, 6 - IPFILTER_CHROMA_sse3 8, 8 - IPFILTER_CHROMA_sse3 8, 16 - IPFILTER_CHROMA_sse3 8, 32 - IPFILTER_CHROMA_sse3 12, 16 - - IPFILTER_CHROMA_sse3 6, 16 - IPFILTER_CHROMA_sse3 8, 12 - IPFILTER_CHROMA_sse3 8, 64 - IPFILTER_CHROMA_sse3 12, 32 - - IPFILTER_CHROMA_sse3 16, 4 - IPFILTER_CHROMA_sse3 16, 8 - IPFILTER_CHROMA_sse3 16, 12 - IPFILTER_CHROMA_sse3 16, 16 - IPFILTER_CHROMA_sse3 16, 32 - IPFILTER_CHROMA_sse3 32, 8 - IPFILTER_CHROMA_sse3 32, 16 - IPFILTER_CHROMA_sse3 32, 24 - IPFILTER_CHROMA_sse3 24, 32 - IPFILTER_CHROMA_sse3 32, 32 - - IPFILTER_CHROMA_sse3 16, 24 - IPFILTER_CHROMA_sse3 16, 64 - IPFILTER_CHROMA_sse3 32, 48 - IPFILTER_CHROMA_sse3 24, 64 - IPFILTER_CHROMA_sse3 32, 64 - - IPFILTER_CHROMA_sse3 64, 64 - IPFILTER_CHROMA_sse3 64, 32 - IPFILTER_CHROMA_sse3 64, 48 - IPFILTER_CHROMA_sse3 48, 64 - IPFILTER_CHROMA_sse3 64, 16 - -%macro FILTER_2 2 - movd m3, [srcq + %1] - movd m4, [srcq + 1 + %1] - punpckldq m3, m4 - punpcklbw m3, m0 - pmaddwd m3, m1 - packssdw m3, m3 - pshuflw m4, m3, q2301 - paddw m3, m4 - psrldq m3, 2 - psubw m3, m2 - movd [dstq + %2], m3 -%endmacro - -%macro FILTER_4 2 - movd m3, [srcq + %1] - movd m4, [srcq + 1 + %1] - punpckldq m3, m4 - punpcklbw m3, m0 - pmaddwd m3, m1 - movd m4, [srcq + 2 + %1] - movd m5, [srcq + 3 + %1] - punpckldq m4, m5 - punpcklbw m4, m0 - pmaddwd m4, m1 - packssdw m3, m4 - pshuflw m4, m3, q2301 - pshufhw m4, m4, q2301 - paddw m3, m4 - psrldq m3, 2 - pshufd m3, m3, q3120 - psubw m3, m2 - movh [dstq + %2], m3 -%endmacro - -%macro FILTER_4TAP_HPS_sse3 2 -INIT_XMM sse3 -cglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 6, src, srcstride, dst, dststride - mov r4d, r4m - add dststrided, dststrided - mova m2, [pw_2000] - pxor m0, m0 - -%ifdef PIC - lea r6, [h_tabw_ChromaCoeff] - movddup m1, [r6 + r4 * 8] -%else - movddup m1, [h_tabw_ChromaCoeff + r4 * 8] -%endif - - mov r4d, %2 - cmp r5m, byte 0 - je .loopH - sub srcq, srcstrideq - add r4d, 3 - -.loopH: -%assign x -1 -%assign y 0 -%rep %1/4 - FILTER_4 x,y -%assign x x+4 -%assign y y+8 -%endrep -%rep (%1 % 4)/2 - FILTER_2 x,y -%endrep - add srcq, srcstrideq - add dstq, dststrideq - - dec r4d - jnz .loopH - RET - -%endmacro - - FILTER_4TAP_HPS_sse3 2, 4 - FILTER_4TAP_HPS_sse3 2, 8 - FILTER_4TAP_HPS_sse3 2, 16 - FILTER_4TAP_HPS_sse3 4, 2 - FILTER_4TAP_HPS_sse3 4, 4 - FILTER_4TAP_HPS_sse3 4, 8 - FILTER_4TAP_HPS_sse3 4, 16 - FILTER_4TAP_HPS_sse3 4, 32 - FILTER_4TAP_HPS_sse3 6, 8 - FILTER_4TAP_HPS_sse3 6, 16 - FILTER_4TAP_HPS_sse3 8, 2 - FILTER_4TAP_HPS_sse3 8, 4 - FILTER_4TAP_HPS_sse3 8, 6 - FILTER_4TAP_HPS_sse3 8, 8 - FILTER_4TAP_HPS_sse3 8, 12 - FILTER_4TAP_HPS_sse3 8, 16 - FILTER_4TAP_HPS_sse3 8, 32 - FILTER_4TAP_HPS_sse3 8, 64 - FILTER_4TAP_HPS_sse3 12, 16 - FILTER_4TAP_HPS_sse3 12, 32 - FILTER_4TAP_HPS_sse3 16, 4 - FILTER_4TAP_HPS_sse3 16, 8 - FILTER_4TAP_HPS_sse3 16, 12 - FILTER_4TAP_HPS_sse3 16, 16 - FILTER_4TAP_HPS_sse3 16, 24 - FILTER_4TAP_HPS_sse3 16, 32 - FILTER_4TAP_HPS_sse3 16, 64 - FILTER_4TAP_HPS_sse3 24, 32 - FILTER_4TAP_HPS_sse3 24, 64 - FILTER_4TAP_HPS_sse3 32, 8 - FILTER_4TAP_HPS_sse3 32, 16 - FILTER_4TAP_HPS_sse3 32, 24 - FILTER_4TAP_HPS_sse3 32, 32 - FILTER_4TAP_HPS_sse3 32, 48 - FILTER_4TAP_HPS_sse3 32, 64 - FILTER_4TAP_HPS_sse3 48, 64 - FILTER_4TAP_HPS_sse3 64, 16 - FILTER_4TAP_HPS_sse3 64, 32 - FILTER_4TAP_HPS_sse3 64, 48 - FILTER_4TAP_HPS_sse3 64, 64 - -%macro FILTER_H4_w2_2 3 - movh %2, [srcq - 1] - pshufb %2, %2, Tm0 - movh %1, [srcq + srcstrideq - 1] - pshufb %1, %1, Tm0 - punpcklqdq %2, %1 - pmaddubsw %2, coef2 - phaddw %2, %2 - pmulhrsw %2, %3 - packuswb %2, %2 - movd r4d, %2 - mov [dstq], r4w - shr r4, 16 - mov [dstq + dststrideq], r4w -%endmacro - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse4 -cglobal interp_4tap_horiz_pp_2x4, 4, 6, 5, src, srcstride, dst, dststride -%define coef2 m4 -%define Tm0 m3 -%define t2 m2 -%define t1 m1 -%define t0 m0 - - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - movd coef2, [r5 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - pshufd coef2, coef2, 0 - mova t2, [pw_512] - mova Tm0, [h_tab_Tm] - -%rep 2 - FILTER_H4_w2_2 t0, t1, t2 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] -%endrep - - RET - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse4 -cglobal interp_4tap_horiz_pp_2x8, 4, 6, 5, src, srcstride, dst, dststride -%define coef2 m4 -%define Tm0 m3 -%define t2 m2 -%define t1 m1 -%define t0 m0 - - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - movd coef2, [r5 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - pshufd coef2, coef2, 0 - mova t2, [pw_512] - mova Tm0, [h_tab_Tm] - -%rep 4 - FILTER_H4_w2_2 t0, t1, t2 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] -%endrep - - RET - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_2x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse4 -cglobal interp_4tap_horiz_pp_2x16, 4, 6, 5, src, srcstride, dst, dststride -%define coef2 m4 -%define Tm0 m3 -%define t2 m2 -%define t1 m1 -%define t0 m0 - - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - movd coef2, [r5 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - pshufd coef2, coef2, 0 - mova t2, [pw_512] - mova Tm0, [h_tab_Tm] - - mov r5d, 16/2 - -.loop: - FILTER_H4_w2_2 t0, t1, t2 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] - dec r5d - jnz .loop - - RET - -%macro FILTER_H4_w4_2 3 - movh %2, [srcq - 1] - pshufb %2, %2, Tm0 - pmaddubsw %2, coef2 - movh %1, [srcq + srcstrideq - 1] - pshufb %1, %1, Tm0 - pmaddubsw %1, coef2 - phaddw %2, %1 - pmulhrsw %2, %3 - packuswb %2, %2 - movd [dstq], %2 - palignr %2, %2, 4 - movd [dstq + dststrideq], %2 -%endmacro - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse4 -cglobal interp_4tap_horiz_pp_4x2, 4, 6, 5, src, srcstride, dst, dststride -%define coef2 m4 -%define Tm0 m3 -%define t2 m2 -%define t1 m1 -%define t0 m0 - - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - movd coef2, [r5 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - pshufd coef2, coef2, 0 - mova t2, [pw_512] - mova Tm0, [h_tab_Tm] - - FILTER_H4_w4_2 t0, t1, t2 - - RET - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse4 -cglobal interp_4tap_horiz_pp_4x4, 4, 6, 5, src, srcstride, dst, dststride -%define coef2 m4 -%define Tm0 m3 -%define t2 m2 -%define t1 m1 -%define t0 m0 - - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - movd coef2, [r5 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - pshufd coef2, coef2, 0 - mova t2, [pw_512] - mova Tm0, [h_tab_Tm] - -%rep 2 - FILTER_H4_w4_2 t0, t1, t2 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] -%endrep - - RET - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse4 -cglobal interp_4tap_horiz_pp_4x8, 4, 6, 5, src, srcstride, dst, dststride -%define coef2 m4 -%define Tm0 m3 -%define t2 m2 -%define t1 m1 -%define t0 m0 - - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - movd coef2, [r5 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - pshufd coef2, coef2, 0 - mova t2, [pw_512] - mova Tm0, [h_tab_Tm] - -%rep 4 - FILTER_H4_w4_2 t0, t1, t2 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] -%endrep - - RET - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse4 -cglobal interp_4tap_horiz_pp_4x16, 4, 6, 5, src, srcstride, dst, dststride -%define coef2 m4 -%define Tm0 m3 -%define t2 m2 -%define t1 m1 -%define t0 m0 - - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - movd coef2, [r5 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - pshufd coef2, coef2, 0 - mova t2, [pw_512] - mova Tm0, [h_tab_Tm] - -%rep 8 - FILTER_H4_w4_2 t0, t1, t2 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] -%endrep - - RET - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_4x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse4 -cglobal interp_4tap_horiz_pp_4x32, 4, 6, 5, src, srcstride, dst, dststride -%define coef2 m4 -%define Tm0 m3 -%define t2 m2 -%define t1 m1 -%define t0 m0 - - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - movd coef2, [r5 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - pshufd coef2, coef2, 0 - mova t2, [pw_512] - mova Tm0, [h_tab_Tm] - - mov r5d, 32/2 - -.loop: - FILTER_H4_w4_2 t0, t1, t2 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] - dec r5d - jnz .loop - - RET - -ALIGN 32 -const interp_4tap_8x8_horiz_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7 - -%macro FILTER_H4_w6 3 - movu %1, [srcq - 1] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - pmulhrsw %2, %3 - packuswb %2, %2 - movd [dstq], %2 - pextrw [dstq + 4], %2, 2 -%endmacro - -%macro FILTER_H4_w8 3 - movu %1, [srcq - 1] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - pmulhrsw %2, %3 - packuswb %2, %2 - movh [dstq], %2 -%endmacro - -%macro FILTER_H4_w12 3 - movu %1, [srcq - 1] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - pmulhrsw %2, %3 - movu %1, [srcq - 1 + 8] - pshufb %1, %1, Tm0 - pmaddubsw %1, coef2 - phaddw %1, %1 - pmulhrsw %1, %3 - packuswb %2, %1 - movh [dstq], %2 - pextrd [dstq + 8], %2, 2 -%endmacro - -%macro FILTER_H4_w16 4 - movu %1, [srcq - 1] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - movu %1, [srcq - 1 + 8] - pshufb %4, %1, Tm0 - pmaddubsw %4, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %4, %1 - pmulhrsw %2, %3 - pmulhrsw %4, %3 - packuswb %2, %4 - movu [dstq], %2 -%endmacro - -%macro FILTER_H4_w24 4 - movu %1, [srcq - 1] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - movu %1, [srcq - 1 + 8] - pshufb %4, %1, Tm0 - pmaddubsw %4, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %4, %1 - pmulhrsw %2, %3 - pmulhrsw %4, %3 - packuswb %2, %4 - movu [dstq], %2 - movu %1, [srcq - 1 + 16] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - pmulhrsw %2, %3 - packuswb %2, %2 - movh [dstq + 16], %2 -%endmacro - -%macro FILTER_H4_w32 4 - movu %1, [srcq - 1] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - movu %1, [srcq - 1 + 8] - pshufb %4, %1, Tm0 - pmaddubsw %4, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %4, %1 - pmulhrsw %2, %3 - pmulhrsw %4, %3 - packuswb %2, %4 - movu [dstq], %2 - movu %1, [srcq - 1 + 16] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - movu %1, [srcq - 1 + 24] - pshufb %4, %1, Tm0 - pmaddubsw %4, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %4, %1 - pmulhrsw %2, %3 - pmulhrsw %4, %3 - packuswb %2, %4 - movu [dstq + 16], %2 -%endmacro - -%macro FILTER_H4_w16o 5 - movu %1, [srcq + %5 - 1] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - movu %1, [srcq + %5 - 1 + 8] - pshufb %4, %1, Tm0 - pmaddubsw %4, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %4, %1 - pmulhrsw %2, %3 - pmulhrsw %4, %3 - packuswb %2, %4 - movu [dstq + %5], %2 -%endmacro - -%macro FILTER_H4_w48 4 - FILTER_H4_w16o %1, %2, %3, %4, 0 - FILTER_H4_w16o %1, %2, %3, %4, 16 - FILTER_H4_w16o %1, %2, %3, %4, 32 -%endmacro - -%macro FILTER_H4_w64 4 - FILTER_H4_w16o %1, %2, %3, %4, 0 - FILTER_H4_w16o %1, %2, %3, %4, 16 - FILTER_H4_w16o %1, %2, %3, %4, 32 - FILTER_H4_w16o %1, %2, %3, %4, 48 -%endmacro - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro IPFILTER_CHROMA 2 -INIT_XMM sse4 -cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 6, src, srcstride, dst, dststride -%define coef2 m5 -%define Tm0 m4 -%define Tm1 m3 -%define t2 m2 -%define t1 m1 -%define t0 m0 - - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - movd coef2, [r5 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mov r5d, %2 - - pshufd coef2, coef2, 0 - mova t2, [pw_512] - mova Tm0, [h_tab_Tm] - mova Tm1, [h_tab_Tm + 16] - -.loop: - FILTER_H4_w%1 t0, t1, t2 - add srcq, srcstrideq - add dstq, dststrideq - - dec r5d - jnz .loop - - RET -%endmacro - - - IPFILTER_CHROMA 6, 8 - IPFILTER_CHROMA 8, 2 - IPFILTER_CHROMA 8, 4 - IPFILTER_CHROMA 8, 6 - IPFILTER_CHROMA 8, 8 - IPFILTER_CHROMA 8, 16 - IPFILTER_CHROMA 8, 32 - IPFILTER_CHROMA 12, 16 - - IPFILTER_CHROMA 6, 16 - IPFILTER_CHROMA 8, 12 - IPFILTER_CHROMA 8, 64 - IPFILTER_CHROMA 12, 32 - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro IPFILTER_CHROMA_W 2 -INIT_XMM sse4 -cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 7, src, srcstride, dst, dststride -%define coef2 m6 -%define Tm0 m5 -%define Tm1 m4 -%define t3 m3 -%define t2 m2 -%define t1 m1 -%define t0 m0 - - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - movd coef2, [r5 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mov r5d, %2 - - pshufd coef2, coef2, 0 - mova t2, [pw_512] - mova Tm0, [h_tab_Tm] - mova Tm1, [h_tab_Tm + 16] - -.loop: - FILTER_H4_w%1 t0, t1, t2, t3 - add srcq, srcstrideq - add dstq, dststrideq - - dec r5d - jnz .loop - - RET -%endmacro - - IPFILTER_CHROMA_W 16, 4 - IPFILTER_CHROMA_W 16, 8 - IPFILTER_CHROMA_W 16, 12 - IPFILTER_CHROMA_W 16, 16 - IPFILTER_CHROMA_W 16, 32 - IPFILTER_CHROMA_W 32, 8 - IPFILTER_CHROMA_W 32, 16 - IPFILTER_CHROMA_W 32, 24 - IPFILTER_CHROMA_W 24, 32 - IPFILTER_CHROMA_W 32, 32 - - IPFILTER_CHROMA_W 16, 24 - IPFILTER_CHROMA_W 16, 64 - IPFILTER_CHROMA_W 32, 48 - IPFILTER_CHROMA_W 24, 64 - IPFILTER_CHROMA_W 32, 64 - - IPFILTER_CHROMA_W 64, 64 - IPFILTER_CHROMA_W 64, 32 - IPFILTER_CHROMA_W 64, 48 - IPFILTER_CHROMA_W 48, 64 - IPFILTER_CHROMA_W 64, 16 - -%macro FILTER_H8_W8 7-8 ; t0, t1, t2, t3, coef, c512, src, dst - movu %1, %7 - pshufb %2, %1, [h_tab_Lm + 0] - pmaddubsw %2, %5 - pshufb %3, %1, [h_tab_Lm + 16] - pmaddubsw %3, %5 - phaddw %2, %3 - pshufb %4, %1, [h_tab_Lm + 32] - pmaddubsw %4, %5 - pshufb %1, %1, [h_tab_Lm + 48] - pmaddubsw %1, %5 - phaddw %4, %1 - phaddw %2, %4 - %if %0 == 8 - pmulhrsw %2, %6 - packuswb %2, %2 - movh %8, %2 - %endif -%endmacro - -%macro FILTER_H8_W4 2 - movu %1, [r0 - 3 + r5] - pshufb %2, %1, [h_tab_Lm] - pmaddubsw %2, m3 - pshufb m7, %1, [h_tab_Lm + 16] - pmaddubsw m7, m3 - phaddw %2, m7 - phaddw %2, %2 -%endmacro - -;---------------------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;---------------------------------------------------------------------------------------------------------------------------- -%macro IPFILTER_LUMA 3 -INIT_XMM sse4 -cglobal interp_8tap_horiz_%3_%1x%2, 4,7,8 - - mov r4d, r4m - -%ifdef PIC - lea r6, [h_tab_LumaCoeff] - movh m3, [r6 + r4 * 8] -%else - movh m3, [h_tab_LumaCoeff + r4 * 8] -%endif - punpcklqdq m3, m3 - -%ifidn %3, pp - mova m2, [pw_512] -%else - mova m2, [pw_2000] -%endif - - mov r4d, %2 -%ifidn %3, ps - add r3, r3 - cmp r5m, byte 0 - je .loopH - lea r6, [r1 + 2 * r1] - sub r0, r6 - add r4d, 7 -%endif - -.loopH: - xor r5, r5 -%rep %1 / 8 - %ifidn %3, pp - FILTER_H8_W8 m0, m1, m4, m5, m3, m2, [r0 - 3 + r5], [r2 + r5] - %else - FILTER_H8_W8 m0, m1, m4, m5, m3, UNUSED, [r0 - 3 + r5] - psubw m1, m2 - movu [r2 + 2 * r5], m1 - %endif - add r5, 8 -%endrep - -%rep (%1 % 8) / 4 - FILTER_H8_W4 m0, m1 - %ifidn %3, pp - pmulhrsw m1, m2 - packuswb m1, m1 - movd [r2 + r5], m1 - %else - psubw m1, m2 - movh [r2 + 2 * r5], m1 - %endif -%endrep - - add r0, r1 - add r2, r3 - - dec r4d - jnz .loopH - RET -%endmacro - -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_4x4, 4,6,6 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4 * 8] -%else - vpbroadcastq m0, [h_tab_LumaCoeff + r4 * 8] -%endif - - mova m1, [h_tab_Lm] - vpbroadcastd m2, [pw_1] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - sub r0, 3 - ; Row 0-1 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - phaddd m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] - - ; Row 2-3 - lea r0, [r0 + r1 * 2] - vbroadcasti128 m4, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m5, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - phaddd m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] - - packssdw m3, m4 ; WORD [R3D R3C R2D R2C R1D R1C R0D R0C R3B R3A R2B R2A R1B R1A R0B R0A] - pmulhrsw m3, [pw_512] - vextracti128 xm4, m3, 1 - packuswb xm3, xm4 ; BYTE [R3D R3C R2D R2C R1D R1C R0D R0C R3B R3A R2B R2A R1B R1A R0B R0A] - pshufb xm3, [interp4_shuf] ; [row3 row1 row2 row0] - - lea r0, [r3 * 3] - movd [r2], xm3 - pextrd [r2+r3], xm3, 2 - pextrd [r2+r3*2], xm3, 1 - pextrd [r2+r0], xm3, 3 - RET - -%macro FILTER_HORIZ_LUMA_AVX2_4xN 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_8tap_horiz_pp_4x%1, 4, 6, 9 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4 * 8] -%else - vpbroadcastq m0, [h_tab_LumaCoeff + r4 * 8] -%endif - - mova m1, [h_tab_Lm] - mova m2, [pw_1] - mova m7, [h_interp8_hps_shuf] - mova m8, [pw_512] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - lea r4, [r1 * 3] - lea r5, [r3 * 3] - sub r0, 3 -%rep %1 / 8 - ; Row 0-1 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - phaddd m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] - - ; Row 2-3 - vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m5, [r0 + r4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - phaddd m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] - - packssdw m3, m4 ; WORD [R3D R3C R2D R2C R1D R1C R0D R0C R3B R3A R2B R2A R1B R1A R0B R0A] - lea r0, [r0 + r1 * 4] - ; Row 4-5 - vbroadcasti128 m5, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - phaddd m5, m4 ; DWORD [R5D R5C R4D R4C R5B R5A R4B R4A] - - ; Row 6-7 - vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m6, [r0 + r4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m6, m1 - pmaddubsw m6, m0 - pmaddwd m6, m2 - phaddd m4, m6 ; DWORD [R7D R7C R6D R6C R7B R7A R6B R6A] - - packssdw m5, m4 ; WORD [R7D R7C R6D R6C R5D R5C R4D R4C R7B R7A R6B R6A R5B R5A R4B R4A] - vpermd m3, m7, m3 - vpermd m5, m7, m5 - pmulhrsw m3, m8 - pmulhrsw m5, m8 - packuswb m3, m5 - vextracti128 xm5, m3, 1 - - movd [r2], xm3 - pextrd [r2 + r3], xm3, 1 - movd [r2 + r3 * 2], xm5 - pextrd [r2 + r5], xm5, 1 - lea r2, [r2 + r3 * 4] - pextrd [r2], xm3, 2 - pextrd [r2 + r3], xm3, 3 - pextrd [r2 + r3 * 2], xm5, 2 - pextrd [r2 + r5], xm5, 3 - lea r0, [r0 + r1 * 4] - lea r2, [r2 + r3 * 4] -%endrep - RET -%endif -%endmacro - - FILTER_HORIZ_LUMA_AVX2_4xN 8 - FILTER_HORIZ_LUMA_AVX2_4xN 16 - -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_8x4, 4, 6, 7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4 * 8] -%else - vpbroadcastq m0, [h_tab_LumaCoeff + r4 * 8] -%endif - - mova m1, [h_tab_Lm] - mova m2, [h_tab_Lm + 32] - - ; register map - ; m0 - interpolate coeff - ; m1, m2 - shuffle order table - - sub r0, 3 - lea r5, [r1 * 3] - lea r4, [r3 * 3] - - ; Row 0 - vbroadcasti128 m3, [r0] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m3, m2 - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddubsw m4, m0 - phaddw m3, m4 - ; Row 1 - vbroadcasti128 m4, [r0 + r1] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m4, m2 - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddubsw m5, m0 - phaddw m4, m5 - - phaddw m3, m4 ; WORD [R1H R1G R1D R1C R0H R0G R0D R0C R1F R1E R1B R1A R0F R0E R0B R0A] - pmulhrsw m3, [pw_512] - - ; Row 2 - vbroadcasti128 m4, [r0 + r1 * 2] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m4, m2 - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddubsw m5, m0 - phaddw m4, m5 - ; Row 3 - vbroadcasti128 m5, [r0 + r5] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m6, m5, m2 - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddubsw m6, m0 - phaddw m5, m6 - - phaddw m4, m5 ; WORD [R3H R3G R3D R3C R2H R2G R2D R2C R3F R3E R3B R3A R2F R2E R2B R2A] - pmulhrsw m4, [pw_512] - - packuswb m3, m4 - vextracti128 xm4, m3, 1 - punpcklwd xm5, xm3, xm4 - - movq [r2], xm5 - movhps [r2 + r3], xm5 - - punpckhwd xm5, xm3, xm4 - movq [r2 + r3 * 2], xm5 - movhps [r2 + r4], xm5 - RET - -%macro IPFILTER_LUMA_AVX2_8xN 2 -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_%1x%2, 4, 7, 7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4 * 8] -%else - vpbroadcastq m0, [h_tab_LumaCoeff + r4 * 8] -%endif - - mova m1, [h_tab_Lm] - mova m2, [h_tab_Lm + 32] - - ; register map - ; m0 - interpolate coeff - ; m1, m2 - shuffle order table - - sub r0, 3 - lea r5, [r1 * 3] - lea r6, [r3 * 3] - mov r4d, %2 / 4 -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m3, m2 - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddubsw m4, m0 - phaddw m3, m4 - ; Row 1 - vbroadcasti128 m4, [r0 + r1] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m4, m2 - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddubsw m5, m0 - phaddw m4, m5 - - phaddw m3, m4 ; WORD [R1H R1G R1D R1C R0H R0G R0D R0C R1F R1E R1B R1A R0F R0E R0B R0A] - pmulhrsw m3, [pw_512] - - ; Row 2 - vbroadcasti128 m4, [r0 + r1 * 2] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m4, m2 - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddubsw m5, m0 - phaddw m4, m5 - ; Row 3 - vbroadcasti128 m5, [r0 + r5] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m6, m5, m2 - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddubsw m6, m0 - phaddw m5, m6 - - phaddw m4, m5 ; WORD [R3H R3G R3D R3C R2H R2G R2D R2C R3F R3E R3B R3A R2F R2E R2B R2A] - pmulhrsw m4, [pw_512] - - packuswb m3, m4 - vextracti128 xm4, m3, 1 - punpcklwd xm5, xm3, xm4 - - movq [r2], xm5 - movhps [r2 + r3], xm5 - - punpckhwd xm5, xm3, xm4 - movq [r2 + r3 * 2], xm5 - movhps [r2 + r6], xm5 - - lea r0, [r0 + r1 * 4] - lea r2, [r2 + r3 * 4] - dec r4d - jnz .loop - RET -%endmacro - - IPFILTER_LUMA_AVX2_8xN 8, 8 - IPFILTER_LUMA_AVX2_8xN 8, 16 - IPFILTER_LUMA_AVX2_8xN 8, 32 - -%macro IPFILTER_LUMA_AVX2 2 -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_%1x%2, 4,6,8 - sub r0, 3 - mov r4d, r4m -%ifdef PIC - lea r5, [h_tab_LumaCoeff] - vpbroadcastd m0, [r5 + r4 * 8] - vpbroadcastd m1, [r5 + r4 * 8 + 4] -%else - vpbroadcastd m0, [h_tab_LumaCoeff + r4 * 8] - vpbroadcastd m1, [h_tab_LumaCoeff + r4 * 8 + 4] -%endif - movu m3, [h_tab_Tm + 16] - vpbroadcastd m7, [pw_1] - - ; register map - ; m0 , m1 interpolate coeff - ; m2 , m2 shuffle order table - ; m7 - pw_1 - - mov r4d, %2/2 -.loop: - ; Row 0 - vbroadcasti128 m4, [r0] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m4, m3 - pshufb m4, [h_tab_Tm] - pmaddubsw m4, m0 - pmaddubsw m5, m1 - paddw m4, m5 - pmaddwd m4, m7 - vbroadcasti128 m5, [r0 + 8] ; second 8 elements in Row0 - pshufb m6, m5, m3 - pshufb m5, [h_tab_Tm] - pmaddubsw m5, m0 - pmaddubsw m6, m1 - paddw m5, m6 - pmaddwd m5, m7 - packssdw m4, m5 ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00] - pmulhrsw m4, [pw_512] - vbroadcasti128 m2, [r0 + r1] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m2, m3 - pshufb m2, [h_tab_Tm] - pmaddubsw m2, m0 - pmaddubsw m5, m1 - paddw m2, m5 - pmaddwd m2, m7 - vbroadcasti128 m5, [r0 + r1 + 8] ; second 8 elements in Row0 - pshufb m6, m5, m3 - pshufb m5, [h_tab_Tm] - pmaddubsw m5, m0 - pmaddubsw m6, m1 - paddw m5, m6 - pmaddwd m5, m7 - packssdw m2, m5 ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00] - pmulhrsw m2, [pw_512] - packuswb m4, m2 - vpermq m4, m4, 11011000b - vextracti128 xm5, m4, 1 - pshufd xm4, xm4, 11011000b - pshufd xm5, xm5, 11011000b - movu [r2], xm4 - movu [r2+r3], xm5 - lea r0, [r0 + r1 * 2] - lea r2, [r2 + r3 * 2] - dec r4d - jnz .loop - RET -%endmacro - -%macro IPFILTER_LUMA_32x_avx2 2 -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_%1x%2, 4,6,8 - sub r0, 3 - mov r4d, r4m -%ifdef PIC - lea r5, [h_tab_LumaCoeff] - vpbroadcastd m0, [r5 + r4 * 8] - vpbroadcastd m1, [r5 + r4 * 8 + 4] -%else - vpbroadcastd m0, [h_tab_LumaCoeff + r4 * 8] - vpbroadcastd m1, [h_tab_LumaCoeff + r4 * 8 + 4] -%endif - movu m3, [h_tab_Tm + 16] - vpbroadcastd m7, [pw_1] - - ; register map - ; m0 , m1 interpolate coeff - ; m2 , m2 shuffle order table - ; m7 - pw_1 - - mov r4d, %2 -.loop: - ; Row 0 - vbroadcasti128 m4, [r0] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m4, m3 - pshufb m4, [h_tab_Tm] - pmaddubsw m4, m0 - pmaddubsw m5, m1 - paddw m4, m5 - pmaddwd m4, m7 - vbroadcasti128 m5, [r0 + 8] - pshufb m6, m5, m3 - pshufb m5, [h_tab_Tm] - pmaddubsw m5, m0 - pmaddubsw m6, m1 - paddw m5, m6 - pmaddwd m5, m7 - packssdw m4, m5 ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00] - pmulhrsw m4, [pw_512] - vbroadcasti128 m2, [r0 + 16] - pshufb m5, m2, m3 - pshufb m2, [h_tab_Tm] - pmaddubsw m2, m0 - pmaddubsw m5, m1 - paddw m2, m5 - pmaddwd m2, m7 - vbroadcasti128 m5, [r0 + 24] - pshufb m6, m5, m3 - pshufb m5, [h_tab_Tm] - pmaddubsw m5, m0 - pmaddubsw m6, m1 - paddw m5, m6 - pmaddwd m5, m7 - packssdw m2, m5 - pmulhrsw m2, [pw_512] - packuswb m4, m2 - vpermq m4, m4, 11011000b - vextracti128 xm5, m4, 1 - pshufd xm4, xm4, 11011000b - pshufd xm5, xm5, 11011000b - movu [r2], xm4 - movu [r2 + 16], xm5 - lea r0, [r0 + r1] - lea r2, [r2 + r3] - dec r4d - jnz .loop - RET -%endmacro -%macro IPFILTER_LUMA_64x_avx2 2 -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_%1x%2, 4,6,8 - sub r0, 3 - mov r4d, r4m -%ifdef PIC - lea r5, [h_tab_LumaCoeff] - vpbroadcastd m0, [r5 + r4 * 8] - vpbroadcastd m1, [r5 + r4 * 8 + 4] -%else - vpbroadcastd m0, [h_tab_LumaCoeff + r4 * 8] - vpbroadcastd m1, [h_tab_LumaCoeff + r4 * 8 + 4] -%endif - movu m3, [h_tab_Tm + 16] - vpbroadcastd m7, [pw_1] - - ; register map - ; m0 , m1 interpolate coeff - ; m2 , m2 shuffle order table - ; m7 - pw_1 - - mov r4d, %2 -.loop: - ; Row 0 - vbroadcasti128 m4, [r0] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m4, m3 - pshufb m4, [h_tab_Tm] - pmaddubsw m4, m0 - pmaddubsw m5, m1 - paddw m4, m5 - pmaddwd m4, m7 - vbroadcasti128 m5, [r0 + 8] - pshufb m6, m5, m3 - pshufb m5, [h_tab_Tm] - pmaddubsw m5, m0 - pmaddubsw m6, m1 - paddw m5, m6 - pmaddwd m5, m7 - packssdw m4, m5 ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00] - pmulhrsw m4, [pw_512] - vbroadcasti128 m2, [r0 + 16] - pshufb m5, m2, m3 - pshufb m2, [h_tab_Tm] - pmaddubsw m2, m0 - pmaddubsw m5, m1 - paddw m2, m5 - pmaddwd m2, m7 - vbroadcasti128 m5, [r0 + 24] - pshufb m6, m5, m3 - pshufb m5, [h_tab_Tm] - pmaddubsw m5, m0 - pmaddubsw m6, m1 - paddw m5, m6 - pmaddwd m5, m7 - packssdw m2, m5 - pmulhrsw m2, [pw_512] - packuswb m4, m2 - vpermq m4, m4, 11011000b - vextracti128 xm5, m4, 1 - pshufd xm4, xm4, 11011000b - pshufd xm5, xm5, 11011000b - movu [r2], xm4 - movu [r2 + 16], xm5 - - vbroadcasti128 m4, [r0 + 32] - pshufb m5, m4, m3 - pshufb m4, [h_tab_Tm] - pmaddubsw m4, m0 - pmaddubsw m5, m1 - paddw m4, m5 - pmaddwd m4, m7 - vbroadcasti128 m5, [r0 + 40] - pshufb m6, m5, m3 - pshufb m5, [h_tab_Tm] - pmaddubsw m5, m0 - pmaddubsw m6, m1 - paddw m5, m6 - pmaddwd m5, m7 - packssdw m4, m5 - pmulhrsw m4, [pw_512] - vbroadcasti128 m2, [r0 + 48] - pshufb m5, m2, m3 - pshufb m2, [h_tab_Tm] - pmaddubsw m2, m0 - pmaddubsw m5, m1 - paddw m2, m5 - pmaddwd m2, m7 - vbroadcasti128 m5, [r0 + 56] - pshufb m6, m5, m3 - pshufb m5, [h_tab_Tm] - pmaddubsw m5, m0 - pmaddubsw m6, m1 - paddw m5, m6 - pmaddwd m5, m7 - packssdw m2, m5 - pmulhrsw m2, [pw_512] - packuswb m4, m2 - vpermq m4, m4, 11011000b - vextracti128 xm5, m4, 1 - pshufd xm4, xm4, 11011000b - pshufd xm5, xm5, 11011000b - movu [r2 +32], xm4 - movu [r2 + 48], xm5 - - lea r0, [r0 + r1] - lea r2, [r2 + r3] - dec r4d - jnz .loop - RET -%endmacro - -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_48x64, 4,6,8 - sub r0, 3 - mov r4d, r4m -%ifdef PIC - lea r5, [h_tab_LumaCoeff] - vpbroadcastd m0, [r5 + r4 * 8] - vpbroadcastd m1, [r5 + r4 * 8 + 4] -%else - vpbroadcastd m0, [h_tab_LumaCoeff + r4 * 8] - vpbroadcastd m1, [h_tab_LumaCoeff + r4 * 8 + 4] -%endif - movu m3, [h_tab_Tm + 16] - vpbroadcastd m7, [pw_1] - - ; register map - ; m0 , m1 interpolate coeff - ; m2 , m2 shuffle order table - ; m7 - pw_1 - - mov r4d, 64 -.loop: - ; Row 0 - vbroadcasti128 m4, [r0] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m4, m3 - pshufb m4, [h_tab_Tm] - pmaddubsw m4, m0 - pmaddubsw m5, m1 - paddw m4, m5 - pmaddwd m4, m7 - vbroadcasti128 m5, [r0 + 8] - pshufb m6, m5, m3 - pshufb m5, [h_tab_Tm] - pmaddubsw m5, m0 - pmaddubsw m6, m1 - paddw m5, m6 - pmaddwd m5, m7 - packssdw m4, m5 ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00] - pmulhrsw m4, [pw_512] - - vbroadcasti128 m2, [r0 + 16] - pshufb m5, m2, m3 - pshufb m2, [h_tab_Tm] - pmaddubsw m2, m0 - pmaddubsw m5, m1 - paddw m2, m5 - pmaddwd m2, m7 - vbroadcasti128 m5, [r0 + 24] - pshufb m6, m5, m3 - pshufb m5, [h_tab_Tm] - pmaddubsw m5, m0 - pmaddubsw m6, m1 - paddw m5, m6 - pmaddwd m5, m7 - packssdw m2, m5 - pmulhrsw m2, [pw_512] - packuswb m4, m2 - vpermq m4, m4, 11011000b - vextracti128 xm5, m4, 1 - pshufd xm4, xm4, 11011000b - pshufd xm5, xm5, 11011000b - movu [r2], xm4 - movu [r2 + 16], xm5 - - vbroadcasti128 m4, [r0 + 32] - pshufb m5, m4, m3 - pshufb m4, [h_tab_Tm] - pmaddubsw m4, m0 - pmaddubsw m5, m1 - paddw m4, m5 - pmaddwd m4, m7 - vbroadcasti128 m5, [r0 + 40] - pshufb m6, m5, m3 - pshufb m5, [h_tab_Tm] - pmaddubsw m5, m0 - pmaddubsw m6, m1 - paddw m5, m6 - pmaddwd m5, m7 - packssdw m4, m5 - pmulhrsw m4, [pw_512] - packuswb m4, m4 - vpermq m4, m4, 11011000b - pshufd xm4, xm4, 11011000b - movu [r2 + 32], xm4 - - lea r0, [r0 + r1] - lea r2, [r2 + r3] - dec r4d - jnz .loop - RET - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_4x4, 4,6,6 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - vpbroadcastd m2, [pw_1] - vbroadcasti128 m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - - ; Row 0-1 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - vinserti128 m3, m3, [r0 + r1], 1 - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 2-3 - lea r0, [r0 + r1 * 2] - vbroadcasti128 m4, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - vinserti128 m4, m4, [r0 + r1], 1 - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - pmulhrsw m3, [pw_512] - vextracti128 xm4, m3, 1 - packuswb xm3, xm4 - - lea r0, [r3 * 3] - movd [r2], xm3 - pextrd [r2+r3], xm3, 2 - pextrd [r2+r3*2], xm3, 1 - pextrd [r2+r0], xm3, 3 - RET - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_2x4, 4, 6, 3 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - dec r0 - lea r4, [r1 * 3] - movq xm1, [r0] - movhps xm1, [r0 + r1] - movq xm2, [r0 + r1 * 2] - movhps xm2, [r0 + r4] - vinserti128 m1, m1, xm2, 1 - pshufb m1, [interp4_hpp_shuf] - pmaddubsw m1, m0 - pmaddwd m1, [pw_1] - vextracti128 xm2, m1, 1 - packssdw xm1, xm2 - pmulhrsw xm1, [pw_512] - packuswb xm1, xm1 - - lea r4, [r3 * 3] - pextrw [r2], xm1, 0 - pextrw [r2 + r3], xm1, 1 - pextrw [r2 + r3 * 2], xm1, 2 - pextrw [r2 + r4], xm1, 3 - RET - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_2x8, 4, 6, 6 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m4, [interp4_hpp_shuf] - mova m5, [pw_1] - dec r0 - lea r4, [r1 * 3] - movq xm1, [r0] - movhps xm1, [r0 + r1] - movq xm2, [r0 + r1 * 2] - movhps xm2, [r0 + r4] - vinserti128 m1, m1, xm2, 1 - lea r0, [r0 + r1 * 4] - movq xm3, [r0] - movhps xm3, [r0 + r1] - movq xm2, [r0 + r1 * 2] - movhps xm2, [r0 + r4] - vinserti128 m3, m3, xm2, 1 - - pshufb m1, m4 - pshufb m3, m4 - pmaddubsw m1, m0 - pmaddubsw m3, m0 - pmaddwd m1, m5 - pmaddwd m3, m5 - packssdw m1, m3 - pmulhrsw m1, [pw_512] - vextracti128 xm2, m1, 1 - packuswb xm1, xm2 - - lea r4, [r3 * 3] - pextrw [r2], xm1, 0 - pextrw [r2 + r3], xm1, 1 - pextrw [r2 + r3 * 2], xm1, 4 - pextrw [r2 + r4], xm1, 5 - lea r2, [r2 + r3 * 4] - pextrw [r2], xm1, 2 - pextrw [r2 + r3], xm1, 3 - pextrw [r2 + r3 * 2], xm1, 6 - pextrw [r2 + r4], xm1, 7 - RET - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_32x32, 4,6,7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m1, [interp4_horiz_shuf1] - vpbroadcastd m2, [pw_1] - mova m6, [pw_512] - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - mov r4d, 32 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 4] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - vbroadcasti128 m4, [r0 + 16] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m5, [r0 + 20] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - - packuswb m3, m4 - vpermq m3, m3, 11011000b - - movu [r2], m3 - lea r2, [r2 + r3] - lea r0, [r0 + r1] - dec r4d - jnz .loop - RET - - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_16x16, 4, 6, 7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m6, [pw_512] - mova m1, [interp4_horiz_shuf1] - vpbroadcastd m2, [pw_1] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - mov r4d, 8 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - ; Row 1 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m5, [r0 + r1 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - - packuswb m3, m4 - vpermq m3, m3, 11011000b - - vextracti128 xm4, m3, 1 - movu [r2], xm3 - movu [r2 + r3], xm4 - lea r2, [r2 + r3 * 2] - lea r0, [r0 + r1 * 2] - dec r4d - jnz .loop - RET - -;-------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;-------------------------------------------------------------------------------------------------------------- - IPFILTER_LUMA 4, 4, pp - IPFILTER_LUMA 4, 8, pp - IPFILTER_LUMA 12, 16, pp - IPFILTER_LUMA 4, 16, pp - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_8x8, 4,6,6 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - movu m1, [h_tab_Tm] - vpbroadcastd m2, [pw_1] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - sub r0, 1 - mov r4d, 2 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 1 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, [pw_512] - lea r0, [r0 + r1 * 2] - - ; Row 2 - vbroadcasti128 m4, [r0 ] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - ; Row 3 - vbroadcasti128 m5, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, [pw_512] - - packuswb m3, m4 - mova m5, [interp_4tap_8x8_horiz_shuf] - vpermd m3, m5, m3 - vextracti128 xm4, m3, 1 - movq [r2], xm3 - movhps [r2 + r3], xm3 - lea r2, [r2 + r3 * 2] - movq [r2], xm4 - movhps [r2 + r3], xm4 - lea r2, [r2 + r3 * 2] - lea r0, [r0 + r1*2] - dec r4d - jnz .loop - RET - - IPFILTER_LUMA_AVX2 16, 4 - IPFILTER_LUMA_AVX2 16, 8 - IPFILTER_LUMA_AVX2 16, 12 - IPFILTER_LUMA_AVX2 16, 16 - IPFILTER_LUMA_AVX2 16, 32 - IPFILTER_LUMA_AVX2 16, 64 - - IPFILTER_LUMA_32x_avx2 32 , 8 - IPFILTER_LUMA_32x_avx2 32 , 16 - IPFILTER_LUMA_32x_avx2 32 , 24 - IPFILTER_LUMA_32x_avx2 32 , 32 - IPFILTER_LUMA_32x_avx2 32 , 64 - - IPFILTER_LUMA_64x_avx2 64 , 64 - IPFILTER_LUMA_64x_avx2 64 , 48 - IPFILTER_LUMA_64x_avx2 64 , 32 - IPFILTER_LUMA_64x_avx2 64 , 16 - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_8x2, 4, 6, 5 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m1, [h_tab_Tm] - mova m2, [pw_1] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 1 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, [pw_512] - vextracti128 xm4, m3, 1 - packuswb xm3, xm4 - pshufd xm3, xm3, 11011000b - movq [r2], xm3 - movhps [r2 + r3], xm3 - RET - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_8x6, 4, 6, 7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m1, [h_tab_Tm] - mova m2, [pw_1] - mova m6, [pw_512] - lea r4, [r1 * 3] - lea r5, [r3 * 3] - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 1 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - ; Row 2 - vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - ; Row 3 - vbroadcasti128 m5, [r0 + r4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - - packuswb m3, m4 - mova m5, [h_interp8_hps_shuf] - vpermd m3, m5, m3 - vextracti128 xm4, m3, 1 - movq [r2], xm3 - movhps [r2 + r3], xm3 - movq [r2 + r3 * 2], xm4 - movhps [r2 + r5], xm4 - lea r2, [r2 + r3 * 4] - lea r0, [r0 + r1 * 4] - ; Row 4 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 5 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - vextracti128 xm4, m3, 1 - packuswb xm3, xm4 - pshufd xm3, xm3, 11011000b - movq [r2], xm3 - movhps [r2 + r3], xm3 - RET - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_6x8, 4, 6, 7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m1, [h_tab_Tm] - mova m2, [pw_1] - mova m6, [pw_512] - lea r4, [r1 * 3] - lea r5, [r3 * 3] - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 -%rep 2 - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 1 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - ; Row 2 - vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - ; Row 3 - vbroadcasti128 m5, [r0 + r4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - - packuswb m3, m4 - vextracti128 xm4, m3, 1 - movd [r2], xm3 - pextrw [r2 + 4], xm4, 0 - pextrd [r2 + r3], xm3, 1 - pextrw [r2 + r3 + 4], xm4, 2 - pextrd [r2 + r3 * 2], xm3, 2 - pextrw [r2 + r3 * 2 + 4], xm4, 4 - pextrd [r2 + r5], xm3, 3 - pextrw [r2 + r5 + 4], xm4, 6 - lea r2, [r2 + r3 * 4] - lea r0, [r0 + r1 * 4] -%endrep - RET - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_64xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;-----------------------------------------------------------------------------------------------------------------------------; -%macro IPFILTER_CHROMA_HPS_64xN 1 -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_64x%1, 4,7,6 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - vbroadcasti128 m2, [pw_1] - vbroadcasti128 m5, [pw_2000] - mova m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - mov r6d, %1 - dec r0 - test r5d, r5d - je .loop - sub r0 , r1 - add r6d , 3 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, m5 - vpermq m3, m3, 11011000b - movu [r2], m3 - - vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 24] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, m5 - vpermq m3, m3, 11011000b - movu [r2 + 32], m3 - - vbroadcasti128 m3, [r0 + 32] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 40] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, m5 - vpermq m3, m3, 11011000b - movu [r2 + 64], m3 - - vbroadcasti128 m3, [r0 + 48] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 56] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, m5 - vpermq m3, m3, 11011000b - movu [r2 + 96], m3 - - add r2, r3 - add r0, r1 - dec r6d - jnz .loop - RET -%endmacro - - IPFILTER_CHROMA_HPS_64xN 64 - IPFILTER_CHROMA_HPS_64xN 32 - IPFILTER_CHROMA_HPS_64xN 48 - IPFILTER_CHROMA_HPS_64xN 16 - -;----------------------------------------------------------------------------------------------------------------------------- -;void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;----------------------------------------------------------------------------------------------------------------------------- - -%macro IPFILTER_LUMA_PS_4xN_AVX2 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_8tap_horiz_ps_4x%1, 6,7,6 - mov r5d, r5m - mov r4d, r4m -%ifdef PIC - lea r6, [h_tab_LumaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [h_tab_LumaCoeff + r4 * 8] -%endif - mova m1, [h_tab_Lm] - add r3d, r3d - vbroadcasti128 m2, [pw_2000] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - pw_2000 - - sub r0, 3 - test r5d, r5d - mov r5d, %1 ; loop count variable - height - jz .preloop - lea r6, [r1 * 3] ; r8 = (N / 2 - 1) * srcStride - sub r0, r6 ; r0(src) - 3 * srcStride - add r5d, 7 ; need extra 7 rows, just set a specially flag here, blkheight += N - 1 (7 - 3 = 4 ; since the last three rows not in loop) - -.preloop: - lea r6, [r3 * 3] -.loop: - ; Row 0-1 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 ; shuffled based on the col order tab_Lm - pmaddubsw m3, m0 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - phaddw m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] - - ; Row 2-3 - lea r0, [r0 + r1 * 2] ;3rd row(i.e 2nd row) - vbroadcasti128 m4, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - vbroadcasti128 m5, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m1 - pmaddubsw m5, m0 - phaddw m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] - phaddw m3, m4 ; all rows and col completed. - - mova m5, [h_interp8_hps_shuf] - vpermd m3, m5, m3 - psubw m3, m2 - - vextracti128 xm4, m3, 1 - movq [r2], xm3 ;row 0 - movhps [r2 + r3], xm3 ;row 1 - movq [r2 + r3 * 2], xm4 ;row 2 - movhps [r2 + r6], xm4 ;row 3 - - lea r0, [r0 + r1 * 2] ; first loop src ->5th row(i.e 4) - lea r2, [r2 + r3 * 4] ; first loop dst ->5th row(i.e 4) - sub r5d, 4 - jz .end - cmp r5d, 4 - jge .loop - - ; Row 8-9 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - phaddw m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] - - ; Row 10 - vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - phaddw m4, m4 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] - phaddw m3, m4 - - vpermd m3, m5, m3 ; m5 don't broken in above - psubw m3, m2 - - vextracti128 xm4, m3, 1 - movq [r2], xm3 - movhps [r2 + r3], xm3 - movq [r2 + r3 * 2], xm4 -.end: - RET -%endif -%endmacro - - IPFILTER_LUMA_PS_4xN_AVX2 4 - IPFILTER_LUMA_PS_4xN_AVX2 8 - IPFILTER_LUMA_PS_4xN_AVX2 16 - -%macro IPFILTER_LUMA_PS_8xN_AVX2 1 -; TODO: verify and enable on X86 mode -%if ARCH_X86_64 == 1 -; void filter_hps(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) -INIT_YMM avx2 -cglobal interp_8tap_horiz_ps_8x%1, 4,7,6 - mov r5d, r5m - mov r4d, r4m - shl r4d, 7 -%ifdef PIC - lea r6, [pb_LumaCoeffVer] - add r6, r4 -%else - lea r6, [pb_LumaCoeffVer + r4] -%endif - add r3d, r3d - vpbroadcastd m0, [pw_2000] - sub r0, 3 - lea r4, [pb_8tap_hps_0] - vbroadcasti128 m5, [r4 + 0 * mmsize] - - ; check row count extend for interpolateHV - test r5d, r5d; - mov r5d, %1 - jz .enter_loop - lea r4, [r1 * 3] ; r8 = (N / 2 - 1) * srcStride - sub r0, r4 ; r0(src)-r8 - add r5d, 8-1-2 ; blkheight += N - 1 (7 - 3 = 4 ; since the last three rows not in loop) - -.enter_loop: - lea r4, [pb_8tap_hps_0] - - ; ***** register map ***** - ; m0 - pw_2000 - ; r4 - base pointer of shuffle order table - ; r5 - count of loop - ; r6 - point to LumaCoeff -.loop: - - ; Row 0-1 - movu xm1, [r0] - movu xm2, [r0 + r1] - vinserti128 m1, m1, xm2, 1 - pshufb m2, m1, m5 ; [0 1 1 2 2 3 3 4 ...] - pshufb m3, m1, [r4 + 1 * mmsize] ; [2 3 3 4 4 5 5 6 ...] - pshufb m4, m1, [r4 + 2 * mmsize] ; [4 5 5 6 6 7 7 8 ...] - pshufb m1, m1, [r4 + 3 * mmsize] ; [6 7 7 8 8 9 9 A ...] - pmaddubsw m2, [r6 + 0 * mmsize] - pmaddubsw m3, [r6 + 1 * mmsize] - pmaddubsw m4, [r6 + 2 * mmsize] - pmaddubsw m1, [r6 + 3 * mmsize] - paddw m2, m3 - paddw m1, m4 - paddw m1, m2 - psubw m1, m0 - - vextracti128 xm2, m1, 1 - movu [r2], xm1 ; row 0 - movu [r2 + r3], xm2 ; row 1 - - lea r0, [r0 + r1 * 2] ; first loop src ->5th row(i.e 4) - lea r2, [r2 + r3 * 2] ; first loop dst ->5th row(i.e 4) - sub r5d, 2 - jg .loop - jz .end - - ; last row - movu xm1, [r0] - pshufb xm2, xm1, xm5 ; [0 1 1 2 2 3 3 4 ...] - pshufb xm3, xm1, [r4 + 1 * mmsize] ; [2 3 3 4 4 5 5 6 ...] - pshufb xm4, xm1, [r4 + 2 * mmsize] ; [4 5 5 6 6 7 7 8 ...] - pshufb xm1, xm1, [r4 + 3 * mmsize] ; [6 7 7 8 8 9 9 A ...] - pmaddubsw xm2, [r6 + 0 * mmsize] - pmaddubsw xm3, [r6 + 1 * mmsize] - pmaddubsw xm4, [r6 + 2 * mmsize] - pmaddubsw xm1, [r6 + 3 * mmsize] - paddw xm2, xm3 - paddw xm1, xm4 - paddw xm1, xm2 - psubw xm1, xm0 - movu [r2], xm1 ;row 0 -.end: - RET -%endif -%endmacro ; IPFILTER_LUMA_PS_8xN_AVX2 - - IPFILTER_LUMA_PS_8xN_AVX2 4 - IPFILTER_LUMA_PS_8xN_AVX2 8 - IPFILTER_LUMA_PS_8xN_AVX2 16 - IPFILTER_LUMA_PS_8xN_AVX2 32 - - -%macro IPFILTER_LUMA_PS_16x_AVX2 2 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_8tap_horiz_ps_%1x%2, 6, 10, 7 - mov r5d, r5m - mov r4d, r4m -%ifdef PIC - lea r6, [h_tab_LumaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [h_tab_LumaCoeff + r4 * 8] -%endif - mova m6, [h_tab_Lm + 32] - mova m1, [h_tab_Lm] - mov r9, %2 ;height - add r3d, r3d - vbroadcasti128 m2, [pw_2000] - - ; register map - ; m0 - interpolate coeff - ; m1 , m6 - shuffle order table - ; m2 - pw_2000 - - xor r7, r7 ; loop count variable - sub r0, 3 - test r5d, r5d - jz .label - lea r8, [r1 * 3] ; r8 = (N / 2 - 1) * srcStride - sub r0, r8 ; r0(src)-r8 - add r9, 7 ; blkheight += N - 1 (7 - 1 = 6 ; since the last one row not in loop) - -.label: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m3, m6 ; row 0 (col 4 to 7) - pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3) - pmaddubsw m3, m0 - pmaddubsw m4, m0 - phaddw m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] - - vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m4, m6 ;row 1 (col 4 to 7) - pshufb m4, m1 ;row 1 (col 0 to 3) - pmaddubsw m4, m0 - pmaddubsw m5, m0 - phaddw m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] - phaddw m3, m4 ; all rows and col completed. - - mova m5, [h_interp8_hps_shuf] - vpermd m3, m5, m3 - psubw m3, m2 - - movu [r2], m3 ;row 0 - - lea r0, [r0 + r1] ; first loop src ->5th row(i.e 4) - lea r2, [r2 + r3] ; first loop dst ->5th row(i.e 4) - dec r9d - jnz .label - - RET -%endif -%endmacro - - - IPFILTER_LUMA_PS_16x_AVX2 16 , 16 - IPFILTER_LUMA_PS_16x_AVX2 16 , 8 - IPFILTER_LUMA_PS_16x_AVX2 16 , 12 - IPFILTER_LUMA_PS_16x_AVX2 16 , 4 - IPFILTER_LUMA_PS_16x_AVX2 16 , 32 - IPFILTER_LUMA_PS_16x_AVX2 16 , 64 - - -;-------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;-------------------------------------------------------------------------------------------------------------- -%macro IPFILTER_LUMA_PP_W8 2 -INIT_XMM sse4 -cglobal interp_8tap_horiz_pp_%1x%2, 4,6,7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_LumaCoeff] - movh m3, [r5 + r4 * 8] -%else - movh m3, [h_tab_LumaCoeff + r4 * 8] -%endif - pshufd m0, m3, 0 ; m0 = coeff-L - pshufd m1, m3, 0x55 ; m1 = coeff-H - lea r5, [h_tab_Tm] ; r5 = shuffle - mova m2, [pw_512] ; m2 = 512 - - mov r4d, %2 -.loopH: -%assign x 0 -%rep %1 / 8 - movu m3, [r0 - 3 + x] ; m3 = [F E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m3, [r5 + 0*16] ; m4 = [6 5 4 3 5 4 3 2 4 3 2 1 3 2 1 0] - pshufb m5, m3, [r5 + 1*16] ; m5 = [A 9 8 7 9 8 7 6 8 7 6 5 7 6 5 4] - pshufb m3, [r5 + 2*16] ; m3 = [E D C B D C B A C B A 9 B A 9 8] - pmaddubsw m4, m0 - pmaddubsw m6, m5, m1 - pmaddubsw m5, m0 - pmaddubsw m3, m1 - paddw m4, m6 - paddw m5, m3 - phaddw m4, m5 - pmulhrsw m4, m2 - packuswb m4, m4 - movh [r2 + x], m4 -%assign x x+8 -%endrep - - add r0, r1 - add r2, r3 - - dec r4d - jnz .loopH - RET -%endmacro - - IPFILTER_LUMA_PP_W8 8, 4 - IPFILTER_LUMA_PP_W8 8, 8 - IPFILTER_LUMA_PP_W8 8, 16 - IPFILTER_LUMA_PP_W8 8, 32 - IPFILTER_LUMA_PP_W8 16, 4 - IPFILTER_LUMA_PP_W8 16, 8 - IPFILTER_LUMA_PP_W8 16, 12 - IPFILTER_LUMA_PP_W8 16, 16 - IPFILTER_LUMA_PP_W8 16, 32 - IPFILTER_LUMA_PP_W8 16, 64 - IPFILTER_LUMA_PP_W8 24, 32 - IPFILTER_LUMA_PP_W8 32, 8 - IPFILTER_LUMA_PP_W8 32, 16 - IPFILTER_LUMA_PP_W8 32, 24 - IPFILTER_LUMA_PP_W8 32, 32 - IPFILTER_LUMA_PP_W8 32, 64 - IPFILTER_LUMA_PP_W8 48, 64 - IPFILTER_LUMA_PP_W8 64, 16 - IPFILTER_LUMA_PP_W8 64, 32 - IPFILTER_LUMA_PP_W8 64, 48 - IPFILTER_LUMA_PP_W8 64, 64 - -;---------------------------------------------------------------------------------------------------------------------------- -; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;---------------------------------------------------------------------------------------------------------------------------- - IPFILTER_LUMA 4, 4, ps - IPFILTER_LUMA 8, 8, ps - IPFILTER_LUMA 8, 4, ps - IPFILTER_LUMA 4, 8, ps - IPFILTER_LUMA 16, 16, ps - IPFILTER_LUMA 16, 8, ps - IPFILTER_LUMA 8, 16, ps - IPFILTER_LUMA 16, 12, ps - IPFILTER_LUMA 12, 16, ps - IPFILTER_LUMA 16, 4, ps - IPFILTER_LUMA 4, 16, ps - IPFILTER_LUMA 32, 32, ps - IPFILTER_LUMA 32, 16, ps - IPFILTER_LUMA 16, 32, ps - IPFILTER_LUMA 32, 24, ps - IPFILTER_LUMA 24, 32, ps - IPFILTER_LUMA 32, 8, ps - IPFILTER_LUMA 8, 32, ps - IPFILTER_LUMA 64, 64, ps - IPFILTER_LUMA 64, 32, ps - IPFILTER_LUMA 32, 64, ps - IPFILTER_LUMA 64, 48, ps - IPFILTER_LUMA 48, 64, ps - IPFILTER_LUMA 64, 16, ps - IPFILTER_LUMA 16, 64, ps - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_2x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;----------------------------------------------------------------------------------------------------------------------------- -%macro FILTER_HORIZ_CHROMA_2xN 2 -INIT_XMM sse4 -cglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 4, src, srcstride, dst, dststride -%define coef2 m3 -%define Tm0 m2 -%define t1 m1 -%define t0 m0 - - dec srcq - mov r4d, r4m - add dststrided, dststrided - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - movd coef2, [r6 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - pshufd coef2, coef2, 0 - mova t1, [pw_2000] - mova Tm0, [h_tab_Tm] - - mov r4d, %2 - cmp r5m, byte 0 - je .loopH - sub srcq, srcstrideq - add r4d, 3 - -.loopH: - movh t0, [srcq] - pshufb t0, t0, Tm0 - pmaddubsw t0, coef2 - phaddw t0, t0 - psubw t0, t1 - movd [dstq], t0 - - lea srcq, [srcq + srcstrideq] - lea dstq, [dstq + dststrideq] - - dec r4d - jnz .loopH - - RET -%endmacro - - FILTER_HORIZ_CHROMA_2xN 2, 4 - FILTER_HORIZ_CHROMA_2xN 2, 8 - - FILTER_HORIZ_CHROMA_2xN 2, 16 - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_4x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;----------------------------------------------------------------------------------------------------------------------------- -%macro FILTER_HORIZ_CHROMA_4xN 2 -INIT_XMM sse4 -cglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 4, src, srcstride, dst, dststride -%define coef2 m3 -%define Tm0 m2 -%define t1 m1 -%define t0 m0 - - dec srcq - mov r4d, r4m - add dststrided, dststrided - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - movd coef2, [r6 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - pshufd coef2, coef2, 0 - mova t1, [pw_2000] - mova Tm0, [h_tab_Tm] - - mov r4d, %2 - cmp r5m, byte 0 - je .loopH - sub srcq, srcstrideq - add r4d, 3 - -.loopH: - movh t0, [srcq] - pshufb t0, t0, Tm0 - pmaddubsw t0, coef2 - phaddw t0, t0 - psubw t0, t1 - movlps [dstq], t0 - - lea srcq, [srcq + srcstrideq] - lea dstq, [dstq + dststrideq] - - dec r4d - jnz .loopH - RET -%endmacro - - FILTER_HORIZ_CHROMA_4xN 4, 2 - FILTER_HORIZ_CHROMA_4xN 4, 4 - FILTER_HORIZ_CHROMA_4xN 4, 8 - FILTER_HORIZ_CHROMA_4xN 4, 16 - - FILTER_HORIZ_CHROMA_4xN 4, 32 - -%macro PROCESS_CHROMA_W6 3 - movu %1, [srcq] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - psubw %2, %3 - movh [dstq], %2 - pshufd %2, %2, 2 - movd [dstq + 8], %2 -%endmacro - -%macro PROCESS_CHROMA_W12 3 - movu %1, [srcq] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - psubw %2, %3 - movu [dstq], %2 - movu %1, [srcq + 8] - pshufb %1, %1, Tm0 - pmaddubsw %1, coef2 - phaddw %1, %1 - psubw %1, %3 - movh [dstq + 16], %1 -%endmacro - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_6x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;----------------------------------------------------------------------------------------------------------------------------- -%macro FILTER_HORIZ_CHROMA 2 -INIT_XMM sse4 -cglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 6, src, srcstride, dst, dststride -%define coef2 m5 -%define Tm0 m4 -%define Tm1 m3 -%define t2 m2 -%define t1 m1 -%define t0 m0 - - dec srcq - mov r4d, r4m - add dststrided, dststrided - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - movd coef2, [r6 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - pshufd coef2, coef2, 0 - mova t2, [pw_2000] - mova Tm0, [h_tab_Tm] - mova Tm1, [h_tab_Tm + 16] - - mov r4d, %2 - cmp r5m, byte 0 - je .loopH - sub srcq, srcstrideq - add r4d, 3 - -.loopH: - PROCESS_CHROMA_W%1 t0, t1, t2 - add srcq, srcstrideq - add dstq, dststrideq - - dec r4d - jnz .loopH - - RET -%endmacro - - FILTER_HORIZ_CHROMA 6, 8 - FILTER_HORIZ_CHROMA 12, 16 - - FILTER_HORIZ_CHROMA 6, 16 - FILTER_HORIZ_CHROMA 12, 32 - -%macro PROCESS_CHROMA_W8 3 - movu %1, [srcq] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - psubw %2, %3 - movu [dstq], %2 -%endmacro - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;----------------------------------------------------------------------------------------------------------------------------- -%macro FILTER_HORIZ_CHROMA_8xN 2 -INIT_XMM sse4 -cglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 6, src, srcstride, dst, dststride -%define coef2 m5 -%define Tm0 m4 -%define Tm1 m3 -%define t2 m2 -%define t1 m1 -%define t0 m0 - - dec srcq - mov r4d, r4m - add dststrided, dststrided - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - movd coef2, [r6 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - pshufd coef2, coef2, 0 - mova t2, [pw_2000] - mova Tm0, [h_tab_Tm] - mova Tm1, [h_tab_Tm + 16] - - mov r4d, %2 - cmp r5m, byte 0 - je .loopH - sub srcq, srcstrideq - add r4d, 3 - -.loopH: - PROCESS_CHROMA_W8 t0, t1, t2 - add srcq, srcstrideq - add dstq, dststrideq - - dec r4d - jnz .loopH - - RET -%endmacro - - FILTER_HORIZ_CHROMA_8xN 8, 2 - FILTER_HORIZ_CHROMA_8xN 8, 4 - FILTER_HORIZ_CHROMA_8xN 8, 6 - FILTER_HORIZ_CHROMA_8xN 8, 8 - FILTER_HORIZ_CHROMA_8xN 8, 16 - FILTER_HORIZ_CHROMA_8xN 8, 32 - - FILTER_HORIZ_CHROMA_8xN 8, 12 - FILTER_HORIZ_CHROMA_8xN 8, 64 - -%macro PROCESS_CHROMA_W16 4 - movu %1, [srcq] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - movu %1, [srcq + 8] - pshufb %4, %1, Tm0 - pmaddubsw %4, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %4, %1 - psubw %2, %3 - psubw %4, %3 - movu [dstq], %2 - movu [dstq + 16], %4 -%endmacro - -%macro PROCESS_CHROMA_W24 4 - movu %1, [srcq] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - movu %1, [srcq + 8] - pshufb %4, %1, Tm0 - pmaddubsw %4, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %4, %1 - psubw %2, %3 - psubw %4, %3 - movu [dstq], %2 - movu [dstq + 16], %4 - movu %1, [srcq + 16] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - psubw %2, %3 - movu [dstq + 32], %2 -%endmacro - -%macro PROCESS_CHROMA_W32 4 - movu %1, [srcq] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - movu %1, [srcq + 8] - pshufb %4, %1, Tm0 - pmaddubsw %4, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %4, %1 - psubw %2, %3 - psubw %4, %3 - movu [dstq], %2 - movu [dstq + 16], %4 - movu %1, [srcq + 16] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - movu %1, [srcq + 24] - pshufb %4, %1, Tm0 - pmaddubsw %4, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %4, %1 - psubw %2, %3 - psubw %4, %3 - movu [dstq + 32], %2 - movu [dstq + 48], %4 -%endmacro - -%macro PROCESS_CHROMA_W16o 5 - movu %1, [srcq + %5] - pshufb %2, %1, Tm0 - pmaddubsw %2, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %2, %1 - movu %1, [srcq + %5 + 8] - pshufb %4, %1, Tm0 - pmaddubsw %4, coef2 - pshufb %1, %1, Tm1 - pmaddubsw %1, coef2 - phaddw %4, %1 - psubw %2, %3 - psubw %4, %3 - movu [dstq + %5 * 2], %2 - movu [dstq + %5 * 2 + 16], %4 -%endmacro - -%macro PROCESS_CHROMA_W48 4 - PROCESS_CHROMA_W16o %1, %2, %3, %4, 0 - PROCESS_CHROMA_W16o %1, %2, %3, %4, 16 - PROCESS_CHROMA_W16o %1, %2, %3, %4, 32 -%endmacro - -%macro PROCESS_CHROMA_W64 4 - PROCESS_CHROMA_W16o %1, %2, %3, %4, 0 - PROCESS_CHROMA_W16o %1, %2, %3, %4, 16 - PROCESS_CHROMA_W16o %1, %2, %3, %4, 32 - PROCESS_CHROMA_W16o %1, %2, %3, %4, 48 -%endmacro - -;------------------------------------------------------------------------------------------------------------------------------ -; void interp_4tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;------------------------------------------------------------------------------------------------------------------------------ -%macro FILTER_HORIZ_CHROMA_WxN 2 -INIT_XMM sse4 -cglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 7, src, srcstride, dst, dststride -%define coef2 m6 -%define Tm0 m5 -%define Tm1 m4 -%define t3 m3 -%define t2 m2 -%define t1 m1 -%define t0 m0 - - dec srcq - mov r4d, r4m - add dststrided, dststrided - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - movd coef2, [r6 + r4 * 4] -%else - movd coef2, [h_tab_ChromaCoeff + r4 * 4] -%endif - - pshufd coef2, coef2, 0 - mova t2, [pw_2000] - mova Tm0, [h_tab_Tm] - mova Tm1, [h_tab_Tm + 16] - - mov r4d, %2 - cmp r5m, byte 0 - je .loopH - sub srcq, srcstrideq - add r4d, 3 - -.loopH: - PROCESS_CHROMA_W%1 t0, t1, t2, t3 - add srcq, srcstrideq - add dstq, dststrideq - - dec r4d - jnz .loopH - - RET -%endmacro - - FILTER_HORIZ_CHROMA_WxN 16, 4 - FILTER_HORIZ_CHROMA_WxN 16, 8 - FILTER_HORIZ_CHROMA_WxN 16, 12 - FILTER_HORIZ_CHROMA_WxN 16, 16 - FILTER_HORIZ_CHROMA_WxN 16, 32 - FILTER_HORIZ_CHROMA_WxN 24, 32 - FILTER_HORIZ_CHROMA_WxN 32, 8 - FILTER_HORIZ_CHROMA_WxN 32, 16 - FILTER_HORIZ_CHROMA_WxN 32, 24 - FILTER_HORIZ_CHROMA_WxN 32, 32 - - FILTER_HORIZ_CHROMA_WxN 16, 24 - FILTER_HORIZ_CHROMA_WxN 16, 64 - FILTER_HORIZ_CHROMA_WxN 24, 64 - FILTER_HORIZ_CHROMA_WxN 32, 48 - FILTER_HORIZ_CHROMA_WxN 32, 64 - - FILTER_HORIZ_CHROMA_WxN 64, 64 - FILTER_HORIZ_CHROMA_WxN 64, 32 - FILTER_HORIZ_CHROMA_WxN 64, 48 - FILTER_HORIZ_CHROMA_WxN 48, 64 - FILTER_HORIZ_CHROMA_WxN 64, 16 - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_32x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;-----------------------------------------------------------------------------------------------------------------------------; -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_32x32, 4,6,8 - mov r4d, r4m - add r3d, r3d - dec r0 - - ; check isRowExt - cmp r5m, byte 0 - - lea r5, [h_tab_ChromaCoeff] - vpbroadcastw m0, [r5 + r4 * 4 + 0] - vpbroadcastw m1, [r5 + r4 * 4 + 2] - mova m7, [pw_2000] - - ; register map - ; m0 - interpolate coeff Low - ; m1 - interpolate coeff High - ; m7 - constant pw_2000 - mov r4d, 32 - je .loop - sub r0, r1 - add r4d, 3 - -.loop: - ; Row 0 - movu m2, [r0] - movu m3, [r0 + 1] - punpckhbw m4, m2, m3 - punpcklbw m2, m3 - pmaddubsw m4, m0 - pmaddubsw m2, m0 - - movu m3, [r0 + 2] - movu m5, [r0 + 3] - punpckhbw m6, m3, m5 - punpcklbw m3, m5 - pmaddubsw m6, m1 - pmaddubsw m3, m1 - - paddw m4, m6 - paddw m2, m3 - psubw m4, m7 - psubw m2, m7 - vperm2i128 m3, m2, m4, 0x20 - vperm2i128 m5, m2, m4, 0x31 - movu [r2], m3 - movu [r2 + mmsize], m5 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop - RET - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_16x16(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;-----------------------------------------------------------------------------------------------------------------------------; -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_16x16, 4,7,6 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - vbroadcasti128 m2, [pw_1] - vbroadcasti128 m5, [pw_2000] - mova m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - mov r6d, 16 - dec r0 - test r5d, r5d - je .loop - sub r0 , r1 - add r6d , 3 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 8] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, m5 - vpermq m3, m3, 11011000b - movu [r2], m3 - - add r2, r3 - add r0, r1 - dec r6d - jnz .loop - RET - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_16xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;----------------------------------------------------------------------------------------------------------------------------- -%macro IPFILTER_CHROMA_PS_16xN_AVX2 2 -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_%1x%2, 4,7,6 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - vbroadcasti128 m2, [pw_1] - vbroadcasti128 m5, [pw_2000] - mova m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - mov r6d, %2 - dec r0 - test r5d, r5d - je .loop - sub r0 , r1 - add r6d , 3 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 8] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, m5 - - vpermq m3, m3, 11011000b - movu [r2], m3 - - add r2, r3 - add r0, r1 - dec r6d - jnz .loop - RET -%endmacro - - IPFILTER_CHROMA_PS_16xN_AVX2 16 , 32 - IPFILTER_CHROMA_PS_16xN_AVX2 16 , 12 - IPFILTER_CHROMA_PS_16xN_AVX2 16 , 8 - IPFILTER_CHROMA_PS_16xN_AVX2 16 , 4 - IPFILTER_CHROMA_PS_16xN_AVX2 16 , 24 - IPFILTER_CHROMA_PS_16xN_AVX2 16 , 64 - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_32xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;----------------------------------------------------------------------------------------------------------------------------- -%macro IPFILTER_CHROMA_PS_32xN_AVX2 2 -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_%1x%2, 4,7,6 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - vbroadcasti128 m2, [pw_1] - vbroadcasti128 m5, [pw_2000] - mova m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - mov r6d, %2 - dec r0 - test r5d, r5d - je .loop - sub r0 , r1 - add r6d , 3 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 8] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, m5 - - vpermq m3, m3, 11011000b - movu [r2], m3 - - vbroadcasti128 m3, [r0 + 16] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 24] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, m5 - - vpermq m3, m3, 11011000b - movu [r2 + 32], m3 - - add r2, r3 - add r0, r1 - dec r6d - jnz .loop - RET -%endmacro - - IPFILTER_CHROMA_PS_32xN_AVX2 32 , 16 - IPFILTER_CHROMA_PS_32xN_AVX2 32 , 24 - IPFILTER_CHROMA_PS_32xN_AVX2 32 , 8 - IPFILTER_CHROMA_PS_32xN_AVX2 32 , 64 - IPFILTER_CHROMA_PS_32xN_AVX2 32 , 48 - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_4x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;----------------------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_4x4, 4,7,5 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - vbroadcasti128 m2, [pw_1] - vbroadcasti128 m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - test r5d, r5d - je .label - sub r0 , r1 - -.label: - ; Row 0-1 - movu xm3, [r0] - vinserti128 m3, m3, [r0 + r1], 1 - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 2-3 - lea r0, [r0 + r1 * 2] - movu xm4, [r0] - vinserti128 m4, m4, [r0 + r1], 1 - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, [pw_2000] - vextracti128 xm4, m3, 1 - movq [r2], xm3 - movq [r2+r3], xm4 - lea r2, [r2 + r3 * 2] - movhps [r2], xm3 - movhps [r2 + r3], xm4 - - test r5d, r5d - jz .end - lea r2, [r2 + r3 * 2] - lea r0, [r0 + r1 * 2] - - ;Row 5-6 - movu xm3, [r0] - vinserti128 m3, m3, [r0 + r1], 1 - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 7 - lea r0, [r0 + r1 * 2] - vbroadcasti128 m4, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, [pw_2000] - - vextracti128 xm4, m3, 1 - movq [r2], xm3 - movq [r2+r3], xm4 - lea r2, [r2 + r3 * 2] - movhps [r2], xm3 -.end: - RET - -cglobal interp_4tap_horiz_ps_4x2, 4,7,5 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - vbroadcasti128 m2, [pw_1] - vbroadcasti128 m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - test r5d, r5d - je .label - sub r0 , r1 - -.label: - ; Row 0-1 - movu xm3, [r0] - vinserti128 m3, m3, [r0 + r1], 1 - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - packssdw m3, m3 - psubw m3, [pw_2000] - vextracti128 xm4, m3, 1 - movq [r2], xm3 - movq [r2+r3], xm4 - - test r5d, r5d - jz .end - lea r2, [r2 + r3 * 2] - lea r0, [r0 + r1 * 2] - - ;Row 2-3 - movu xm3, [r0] - vinserti128 m3, m3, [r0 + r1], 1 - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 5 - lea r0, [r0 + r1 * 2] - vbroadcasti128 m4, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, [pw_2000] - - vextracti128 xm4, m3, 1 - movq [r2], xm3 - movq [r2+r3], xm4 - lea r2, [r2 + r3 * 2] - movhps [r2], xm3 -.end: - RET - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_4xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;-----------------------------------------------------------------------------------------------------------------------------; -%macro IPFILTER_CHROMA_PS_4xN_AVX2 2 -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_%1x%2, 4,7,5 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - vbroadcasti128 m2, [pw_1] - vbroadcasti128 m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - mov r4, %2 - dec r0 - test r5d, r5d - je .loop - sub r0 , r1 - - -.loop: - sub r4d, 4 - ; Row 0-1 - movu xm3, [r0] - vinserti128 m3, m3, [r0 + r1], 1 - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 2-3 - lea r0, [r0 + r1 * 2] - movu xm4, [r0] - vinserti128 m4, m4, [r0 + r1], 1 - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, [pw_2000] - vextracti128 xm4, m3, 1 - movq [r2], xm3 - movq [r2+r3], xm4 - lea r2, [r2 + r3 * 2] - movhps [r2], xm3 - movhps [r2 + r3], xm4 - - lea r2, [r2 + r3 * 2] - lea r0, [r0 + r1 * 2] - - test r4d, r4d - jnz .loop - test r5d, r5d - jz .end - - ;Row 5-6 - movu xm3, [r0] - vinserti128 m3, m3, [r0 + r1], 1 - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 7 - lea r0, [r0 + r1 * 2] - vbroadcasti128 m4, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, [pw_2000] - - vextracti128 xm4, m3, 1 - movq [r2], xm3 - movq [r2+r3], xm4 - lea r2, [r2 + r3 * 2] - movhps [r2], xm3 -.end: - RET -%endmacro - - IPFILTER_CHROMA_PS_4xN_AVX2 4 , 8 - IPFILTER_CHROMA_PS_4xN_AVX2 4 , 16 - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_8x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;-----------------------------------------------------------------------------------------------------------------------------; -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_8x8, 4,7,6 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - vbroadcasti128 m2, [pw_1] - vbroadcasti128 m5, [pw_2000] - mova m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - mov r6d, 4 - dec r0 - test r5d, r5d - je .loop - sub r0 , r1 - add r6d , 1 - -.loop: - dec r6d - ; Row 0 - vbroadcasti128 m3, [r0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 1 - vbroadcasti128 m4, [r0 + r1] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, m5 - - vpermq m3, m3, 11011000b - vextracti128 xm4, m3, 1 - movu [r2], xm3 - movu [r2 + r3], xm4 - - lea r2, [r2 + r3 * 2] - lea r0, [r0 + r1 * 2] - test r6d, r6d - jnz .loop - test r5d, r5d - je .end - - ;Row 11 - vbroadcasti128 m3, [r0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - packssdw m3, m3 - psubw m3, m5 - vpermq m3, m3, 11011000b - movu [r2], xm3 -.end: - RET - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_4x2, 4,6,4 - mov r4d, r4m -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - vbroadcasti128 m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - - ; Row 0-1 - movu xm2, [r0 - 1] - vinserti128 m2, m2, [r0 + r1 - 1], 1 - pshufb m2, m1 - pmaddubsw m2, m0 - pmaddwd m2, [pw_1] - - packssdw m2, m2 - pmulhrsw m2, [pw_512] - vextracti128 xm3, m2, 1 - packuswb xm2, xm3 - - movd [r2], xm2 - pextrd [r2+r3], xm2, 2 - RET - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_32xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -%macro IPFILTER_CHROMA_PP_32xN_AVX2 2 -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_%1x%2, 4,6,7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m1, [interp4_horiz_shuf1] - vpbroadcastd m2, [pw_1] - mova m6, [pw_512] - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - mov r4d, %2 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 4] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - vbroadcasti128 m4, [r0 + 16] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m5, [r0 + 20] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - - packuswb m3, m4 - vpermq m3, m3, 11011000b - - movu [r2], m3 - add r2, r3 - add r0, r1 - dec r4d - jnz .loop - RET -%endmacro - - IPFILTER_CHROMA_PP_32xN_AVX2 32, 16 - IPFILTER_CHROMA_PP_32xN_AVX2 32, 24 - IPFILTER_CHROMA_PP_32xN_AVX2 32, 8 - IPFILTER_CHROMA_PP_32xN_AVX2 32, 64 - IPFILTER_CHROMA_PP_32xN_AVX2 32, 48 - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_8xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -%macro IPFILTER_CHROMA_PP_8xN_AVX2 2 -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_%1x%2, 4,6,6 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - movu m1, [h_tab_Tm] - vpbroadcastd m2, [pw_1] - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - sub r0, 1 - mov r4d, %2 - -.loop: - sub r4d, 4 - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 1 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, [pw_512] - lea r0, [r0 + r1 * 2] - - ; Row 2 - vbroadcasti128 m4, [r0 ] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - ; Row 3 - vbroadcasti128 m5, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, [pw_512] - - packuswb m3, m4 - mova m5, [interp_4tap_8x8_horiz_shuf] - vpermd m3, m5, m3 - vextracti128 xm4, m3, 1 - movq [r2], xm3 - movhps [r2 + r3], xm3 - lea r2, [r2 + r3 * 2] - movq [r2], xm4 - movhps [r2 + r3], xm4 - lea r2, [r2 + r3 * 2] - lea r0, [r0 + r1*2] - test r4d, r4d - jnz .loop - RET -%endmacro - - IPFILTER_CHROMA_PP_8xN_AVX2 8 , 16 - IPFILTER_CHROMA_PP_8xN_AVX2 8 , 32 - IPFILTER_CHROMA_PP_8xN_AVX2 8 , 4 - IPFILTER_CHROMA_PP_8xN_AVX2 8 , 64 - IPFILTER_CHROMA_PP_8xN_AVX2 8 , 12 - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_4xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -%macro IPFILTER_CHROMA_PP_4xN_AVX2 2 -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_%1x%2, 4,6,6 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - vpbroadcastd m2, [pw_1] - vbroadcasti128 m1, [h_tab_Tm] - mov r4d, %2 - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - -.loop: - sub r4d, 4 - ; Row 0-1 - movu xm3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - vinserti128 m3, m3, [r0 + r1], 1 - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 2-3 - lea r0, [r0 + r1 * 2] - movu xm4, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - vinserti128 m4, m4, [r0 + r1], 1 - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - pmulhrsw m3, [pw_512] - vextracti128 xm4, m3, 1 - packuswb xm3, xm4 - - movd [r2], xm3 - pextrd [r2+r3], xm3, 2 - lea r2, [r2 + r3 * 2] - pextrd [r2], xm3, 1 - pextrd [r2+r3], xm3, 3 - - lea r0, [r0 + r1 * 2] - lea r2, [r2 + r3 * 2] - test r4d, r4d - jnz .loop - RET -%endmacro - - IPFILTER_CHROMA_PP_4xN_AVX2 4 , 8 - IPFILTER_CHROMA_PP_4xN_AVX2 4 , 16 - -%macro IPFILTER_LUMA_PS_32xN_AVX2 2 -INIT_YMM avx2 -cglobal interp_8tap_horiz_ps_%1x%2, 4, 7, 8 - mov r5d, r5m - mov r4d, r4m -%ifdef PIC - lea r6, [h_tab_LumaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [h_tab_LumaCoeff + r4 * 8] -%endif - mova m6, [h_tab_Lm + 32] - mova m1, [h_tab_Lm] - mov r4d, %2 ;height - add r3d, r3d - vbroadcasti128 m2, [pw_1] - mova m7, [h_interp8_hps_shuf] - - ; register map - ; m0 - interpolate coeff - ; m1 , m6 - shuffle order table - ; m2 - pw_1 - - - sub r0, 3 - test r5d, r5d - jz .label - lea r6, [r1 * 3] ; r8 = (N / 2 - 1) * srcStride - sub r0, r6 - add r4d, 7 - -.label: - lea r6, [pw_2000] -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m3, m6 ; row 0 (col 4 to 7) - pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3) - pmaddubsw m3, m0 - pmaddubsw m4, m0 - pmaddwd m3, m2 - pmaddwd m4, m2 - packssdw m3, m4 - - - vbroadcasti128 m4, [r0 + 8] - pshufb m5, m4, m6 ;row 0 (col 12 to 15) - pshufb m4, m1 ;row 0 (col 8 to 11) - pmaddubsw m4, m0 - pmaddubsw m5, m0 - pmaddwd m4, m2 - pmaddwd m5, m2 - packssdw m4, m5 - - pmaddwd m3, m2 - pmaddwd m4, m2 - packssdw m3, m4 - vpermd m3, m7, m3 - psubw m3, [r6] - - movu [r2], m3 ;row 0 - - vbroadcasti128 m3, [r0 + 16] - pshufb m4, m3, m6 ; row 0 (col 20 to 23) - pshufb m3, m1 ; row 0 (col 16 to 19) - pmaddubsw m3, m0 - pmaddubsw m4, m0 - pmaddwd m3, m2 - pmaddwd m4, m2 - packssdw m3, m4 - - vbroadcasti128 m4, [r0 + 24] - pshufb m5, m4, m6 ;row 0 (col 28 to 31) - pshufb m4, m1 ;row 0 (col 24 to 27) - pmaddubsw m4, m0 - pmaddubsw m5, m0 - pmaddwd m4, m2 - pmaddwd m5, m2 - packssdw m4, m5 - - pmaddwd m3, m2 - pmaddwd m4, m2 - packssdw m3, m4 - vpermd m3, m7, m3 - psubw m3, [r6] - - movu [r2 + 32], m3 ;row 0 - - add r0, r1 - add r2, r3 - dec r4d - jnz .loop - RET -%endmacro - - IPFILTER_LUMA_PS_32xN_AVX2 32 , 32 - IPFILTER_LUMA_PS_32xN_AVX2 32 , 16 - IPFILTER_LUMA_PS_32xN_AVX2 32 , 24 - IPFILTER_LUMA_PS_32xN_AVX2 32 , 8 - IPFILTER_LUMA_PS_32xN_AVX2 32 , 64 - -INIT_YMM avx2 -cglobal interp_8tap_horiz_ps_48x64, 4, 7, 8 - mov r5d, r5m - mov r4d, r4m -%ifdef PIC - lea r6, [h_tab_LumaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [h_tab_LumaCoeff + r4 * 8] -%endif - mova m6, [h_tab_Lm + 32] - mova m1, [h_tab_Lm] - mov r4d, 64 ;height - add r3d, r3d - vbroadcasti128 m2, [pw_2000] - mova m7, [pw_1] - - ; register map - ; m0 - interpolate coeff - ; m1 , m6 - shuffle order table - ; m2 - pw_2000 - - sub r0, 3 - test r5d, r5d - jz .label - lea r6, [r1 * 3] ; r6 = (N / 2 - 1) * srcStride - sub r0, r6 ; r0(src)-r6 - add r4d, 7 ; blkheight += N - 1 (7 - 1 = 6 ; since the last one row not in loop) - -.label: - lea r6, [h_interp8_hps_shuf] -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m3, m6 ; row 0 (col 4 to 7) - pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3) - pmaddubsw m3, m0 - pmaddubsw m4, m0 - pmaddwd m3, m7 - pmaddwd m4, m7 - packssdw m3, m4 - - vbroadcasti128 m4, [r0 + 8] - pshufb m5, m4, m6 ;row 0 (col 12 to 15) - pshufb m4, m1 ;row 0 (col 8 to 11) - pmaddubsw m4, m0 - pmaddubsw m5, m0 - pmaddwd m4, m7 - pmaddwd m5, m7 - packssdw m4, m5 - pmaddwd m3, m7 - pmaddwd m4, m7 - packssdw m3, m4 - mova m5, [r6] - vpermd m3, m5, m3 - psubw m3, m2 - movu [r2], m3 ;row 0 - - vbroadcasti128 m3, [r0 + 16] - pshufb m4, m3, m6 ; row 0 (col 20 to 23) - pshufb m3, m1 ; row 0 (col 16 to 19) - pmaddubsw m3, m0 - pmaddubsw m4, m0 - pmaddwd m3, m7 - pmaddwd m4, m7 - packssdw m3, m4 - - vbroadcasti128 m4, [r0 + 24] - pshufb m5, m4, m6 ;row 0 (col 28 to 31) - pshufb m4, m1 ;row 0 (col 24 to 27) - pmaddubsw m4, m0 - pmaddubsw m5, m0 - pmaddwd m4, m7 - pmaddwd m5, m7 - packssdw m4, m5 - pmaddwd m3, m7 - pmaddwd m4, m7 - packssdw m3, m4 - mova m5, [r6] - vpermd m3, m5, m3 - psubw m3, m2 - movu [r2 + 32], m3 ;row 0 - - vbroadcasti128 m3, [r0 + 32] - pshufb m4, m3, m6 ; row 0 (col 36 to 39) - pshufb m3, m1 ; row 0 (col 32 to 35) - pmaddubsw m3, m0 - pmaddubsw m4, m0 - pmaddwd m3, m7 - pmaddwd m4, m7 - packssdw m3, m4 - - vbroadcasti128 m4, [r0 + 40] - pshufb m5, m4, m6 ;row 0 (col 44 to 47) - pshufb m4, m1 ;row 0 (col 40 to 43) - pmaddubsw m4, m0 - pmaddubsw m5, m0 - pmaddwd m4, m7 - pmaddwd m5, m7 - packssdw m4, m5 - pmaddwd m3, m7 - pmaddwd m4, m7 - packssdw m3, m4 - mova m5, [r6] - vpermd m3, m5, m3 - psubw m3, m2 - movu [r2 + 64], m3 ;row 0 - - add r0, r1 - add r2, r3 - dec r4d - jnz .loop - RET - -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_24x32, 4,6,8 - sub r0, 3 - mov r4d, r4m -%ifdef PIC - lea r5, [h_tab_LumaCoeff] - vpbroadcastd m0, [r5 + r4 * 8] - vpbroadcastd m1, [r5 + r4 * 8 + 4] -%else - vpbroadcastd m0, [h_tab_LumaCoeff + r4 * 8] - vpbroadcastd m1, [h_tab_LumaCoeff + r4 * 8 + 4] -%endif - movu m3, [h_tab_Tm + 16] - vpbroadcastd m7, [pw_1] - lea r5, [h_tab_Tm] - - ; register map - ; m0 , m1 interpolate coeff - ; m2 , m2 shuffle order table - ; m7 - pw_1 - - mov r4d, 32 -.loop: - ; Row 0 - vbroadcasti128 m4, [r0] ; [x E D C B A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m4, m3 - pshufb m4, [r5] - pmaddubsw m4, m0 - pmaddubsw m5, m1 - paddw m4, m5 - pmaddwd m4, m7 - - vbroadcasti128 m5, [r0 + 8] - pshufb m6, m5, m3 - pshufb m5, [r5] - pmaddubsw m5, m0 - pmaddubsw m6, m1 - paddw m5, m6 - pmaddwd m5, m7 - packssdw m4, m5 ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00] - pmulhrsw m4, [pw_512] - - vbroadcasti128 m2, [r0 + 16] - pshufb m5, m2, m3 - pshufb m2, [r5] - pmaddubsw m2, m0 - pmaddubsw m5, m1 - paddw m2, m5 - pmaddwd m2, m7 - - packssdw m2, m2 - pmulhrsw m2, [pw_512] - packuswb m4, m2 - vpermq m4, m4, 11011000b - vextracti128 xm5, m4, 1 - pshufd xm4, xm4, 11011000b - pshufd xm5, xm5, 11011000b - - movu [r2], xm4 - movq [r2 + 16], xm5 - add r0, r1 - add r2, r3 - dec r4d - jnz .loop - RET - -INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_12x16, 4,6,8 - sub r0, 3 - mov r4d, r4m -%ifdef PIC - lea r5, [h_tab_LumaCoeff] - vpbroadcastd m0, [r5 + r4 * 8] - vpbroadcastd m1, [r5 + r4 * 8 + 4] -%else - vpbroadcastd m0, [h_tab_LumaCoeff + r4 * 8] - vpbroadcastd m1, [h_tab_LumaCoeff + r4 * 8 + 4] -%endif - movu m3, [h_tab_Tm + 16] - vpbroadcastd m7, [pw_1] - lea r5, [h_tab_Tm] - - ; register map - ; m0 , m1 interpolate coeff - ; m2 , m2 shuffle order table - ; m7 - pw_1 - - mov r4d, 8 -.loop: - ; Row 0 - vbroadcasti128 m4, [r0] ;first 8 element - pshufb m5, m4, m3 - pshufb m4, [r5] - pmaddubsw m4, m0 - pmaddubsw m5, m1 - paddw m4, m5 - pmaddwd m4, m7 - - vbroadcasti128 m5, [r0 + 8] ; element 8 to 11 - pshufb m6, m5, m3 - pshufb m5, [r5] - pmaddubsw m5, m0 - pmaddubsw m6, m1 - paddw m5, m6 - pmaddwd m5, m7 - - packssdw m4, m5 ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00] - pmulhrsw m4, [pw_512] - - ;Row 1 - vbroadcasti128 m2, [r0 + r1] - pshufb m5, m2, m3 - pshufb m2, [r5] - pmaddubsw m2, m0 - pmaddubsw m5, m1 - paddw m2, m5 - pmaddwd m2, m7 - - vbroadcasti128 m5, [r0 + r1 + 8] - pshufb m6, m5, m3 - pshufb m5, [r5] - pmaddubsw m5, m0 - pmaddubsw m6, m1 - paddw m5, m6 - pmaddwd m5, m7 - - packssdw m2, m5 - pmulhrsw m2, [pw_512] - packuswb m4, m2 - vpermq m4, m4, 11011000b - vextracti128 xm5, m4, 1 - pshufd xm4, xm4, 11011000b - pshufd xm5, xm5, 11011000b - - movq [r2], xm4 - pextrd [r2+8], xm4, 2 - movq [r2 + r3], xm5 - pextrd [r2+r3+8], xm5, 2 - lea r0, [r0 + r1 * 2] - lea r2, [r2 + r3 * 2] - dec r4d - jnz .loop - RET - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_16xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -%macro IPFILTER_CHROMA_PP_16xN_AVX2 2 -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m6, [pw_512] - mova m1, [interp4_horiz_shuf1] - vpbroadcastd m2, [pw_1] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - mov r4d, %2/2 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - ; Row 1 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m5, [r0 + r1 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - - packuswb m3, m4 - vpermq m3, m3, 11011000b - - vextracti128 xm4, m3, 1 - movu [r2], xm3 - movu [r2 + r3], xm4 - lea r2, [r2 + r3 * 2] - lea r0, [r0 + r1 * 2] - dec r4d - jnz .loop - RET -%endmacro - - IPFILTER_CHROMA_PP_16xN_AVX2 16 , 8 - IPFILTER_CHROMA_PP_16xN_AVX2 16 , 32 - IPFILTER_CHROMA_PP_16xN_AVX2 16 , 12 - IPFILTER_CHROMA_PP_16xN_AVX2 16 , 4 - IPFILTER_CHROMA_PP_16xN_AVX2 16 , 64 - IPFILTER_CHROMA_PP_16xN_AVX2 16 , 24 - -%macro IPFILTER_LUMA_PS_64xN_AVX2 1 -INIT_YMM avx2 -cglobal interp_8tap_horiz_ps_64x%1, 4, 7, 8 - mov r5d, r5m - mov r4d, r4m -%ifdef PIC - lea r6, [h_tab_LumaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [h_tab_LumaCoeff + r4 * 8] -%endif - mova m6, [h_tab_Lm + 32] - mova m1, [h_tab_Lm] - mov r4d, %1 ;height - add r3d, r3d - vbroadcasti128 m2, [pw_1] - mova m7, [h_interp8_hps_shuf] - - ; register map - ; m0 - interpolate coeff - ; m1 , m6 - shuffle order table - ; m2 - pw_2000 - - sub r0, 3 - test r5d, r5d - jz .label - lea r6, [r1 * 3] - sub r0, r6 ; r0(src)-r6 - add r4d, 7 ; blkheight += N - 1 - -.label: - lea r6, [pw_2000] -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m3, m6 ; row 0 (col 4 to 7) - pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3) - pmaddubsw m3, m0 - pmaddubsw m4, m0 - pmaddwd m3, m2 - pmaddwd m4, m2 - packssdw m3, m4 - - vbroadcasti128 m4, [r0 + 8] - pshufb m5, m4, m6 ;row 0 (col 12 to 15) - pshufb m4, m1 ;row 0 (col 8 to 11) - pmaddubsw m4, m0 - pmaddubsw m5, m0 - pmaddwd m4, m2 - pmaddwd m5, m2 - packssdw m4, m5 - pmaddwd m3, m2 - pmaddwd m4, m2 - packssdw m3, m4 - vpermd m3, m7, m3 - psubw m3, [r6] - movu [r2], m3 ;row 0 - - vbroadcasti128 m3, [r0 + 16] - pshufb m4, m3, m6 ; row 0 (col 20 to 23) - pshufb m3, m1 ; row 0 (col 16 to 19) - pmaddubsw m3, m0 - pmaddubsw m4, m0 - pmaddwd m3, m2 - pmaddwd m4, m2 - packssdw m3, m4 - - vbroadcasti128 m4, [r0 + 24] - pshufb m5, m4, m6 ;row 0 (col 28 to 31) - pshufb m4, m1 ;row 0 (col 24 to 27) - pmaddubsw m4, m0 - pmaddubsw m5, m0 - pmaddwd m4, m2 - pmaddwd m5, m2 - packssdw m4, m5 - pmaddwd m3, m2 - pmaddwd m4, m2 - packssdw m3, m4 - vpermd m3, m7, m3 - psubw m3, [r6] - movu [r2 + 32], m3 ;row 0 - - vbroadcasti128 m3, [r0 + 32] - pshufb m4, m3, m6 ; row 0 (col 36 to 39) - pshufb m3, m1 ; row 0 (col 32 to 35) - pmaddubsw m3, m0 - pmaddubsw m4, m0 - pmaddwd m3, m2 - pmaddwd m4, m2 - packssdw m3, m4 - - vbroadcasti128 m4, [r0 + 40] - pshufb m5, m4, m6 ;row 0 (col 44 to 47) - pshufb m4, m1 ;row 0 (col 40 to 43) - pmaddubsw m4, m0 - pmaddubsw m5, m0 - pmaddwd m4, m2 - pmaddwd m5, m2 - packssdw m4, m5 - pmaddwd m3, m2 - pmaddwd m4, m2 - packssdw m3, m4 - vpermd m3, m7, m3 - psubw m3, [r6] - movu [r2 + 64], m3 ;row 0 - vbroadcasti128 m3, [r0 + 48] - pshufb m4, m3, m6 ; row 0 (col 52 to 55) - pshufb m3, m1 ; row 0 (col 48 to 51) - pmaddubsw m3, m0 - pmaddubsw m4, m0 - pmaddwd m3, m2 - pmaddwd m4, m2 - packssdw m3, m4 - - vbroadcasti128 m4, [r0 + 56] - pshufb m5, m4, m6 ;row 0 (col 60 to 63) - pshufb m4, m1 ;row 0 (col 56 to 59) - pmaddubsw m4, m0 - pmaddubsw m5, m0 - pmaddwd m4, m2 - pmaddwd m5, m2 - packssdw m4, m5 - pmaddwd m3, m2 - pmaddwd m4, m2 - packssdw m3, m4 - vpermd m3, m7, m3 - psubw m3, [r6] - movu [r2 + 96], m3 ;row 0 - - add r0, r1 - add r2, r3 - dec r4d - jnz .loop - RET -%endmacro - - IPFILTER_LUMA_PS_64xN_AVX2 64 - IPFILTER_LUMA_PS_64xN_AVX2 48 - IPFILTER_LUMA_PS_64xN_AVX2 32 - IPFILTER_LUMA_PS_64xN_AVX2 16 - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_8xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;----------------------------------------------------------------------------------------------------------------------------- -%macro IPFILTER_CHROMA_PS_8xN_AVX2 1 -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_8x%1, 4,7,6 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - vbroadcasti128 m2, [pw_1] - vbroadcasti128 m5, [pw_2000] - mova m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - mov r6d, %1/2 - dec r0 - test r5d, r5d - jz .loop - sub r0 , r1 - inc r6d - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 1 - vbroadcasti128 m4, [r0 + r1] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - psubw m3, m5 - vpermq m3, m3, 11011000b - vextracti128 xm4, m3, 1 - movu [r2], xm3 - movu [r2 + r3], xm4 - - lea r2, [r2 + r3 * 2] - lea r0, [r0 + r1 * 2] - dec r6d - jnz .loop - test r5d, r5d - jz .end - - ;Row 11 - vbroadcasti128 m3, [r0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - packssdw m3, m3 - psubw m3, m5 - vpermq m3, m3, 11011000b - movu [r2], xm3 -.end: - RET -%endmacro - - IPFILTER_CHROMA_PS_8xN_AVX2 2 - IPFILTER_CHROMA_PS_8xN_AVX2 32 - IPFILTER_CHROMA_PS_8xN_AVX2 16 - IPFILTER_CHROMA_PS_8xN_AVX2 6 - IPFILTER_CHROMA_PS_8xN_AVX2 4 - IPFILTER_CHROMA_PS_8xN_AVX2 12 - IPFILTER_CHROMA_PS_8xN_AVX2 64 - -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_2x4, 4, 7, 3 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova xm3, [pw_2000] - dec r0 - test r5d, r5d - jz .label - sub r0, r1 - -.label: - lea r6, [r1 * 3] - movq xm1, [r0] - movhps xm1, [r0 + r1] - movq xm2, [r0 + r1 * 2] - movhps xm2, [r0 + r6] - - vinserti128 m1, m1, xm2, 1 - pshufb m1, [interp4_hpp_shuf] - pmaddubsw m1, m0 - pmaddwd m1, [pw_1] - vextracti128 xm2, m1, 1 - packssdw xm1, xm2 - psubw xm1, xm3 - - lea r4, [r3 * 3] - movd [r2], xm1 - pextrd [r2 + r3], xm1, 1 - pextrd [r2 + r3 * 2], xm1, 2 - pextrd [r2 + r4], xm1, 3 - - test r5d, r5d - jz .end - lea r2, [r2 + r3 * 4] - lea r0, [r0 + r1 * 4] - - movq xm1, [r0] - movhps xm1, [r0 + r1] - movq xm2, [r0 + r1 * 2] - vinserti128 m1, m1, xm2, 1 - pshufb m1, [interp4_hpp_shuf] - pmaddubsw m1, m0 - pmaddwd m1, [pw_1] - vextracti128 xm2, m1, 1 - packssdw xm1, xm2 - psubw xm1, xm3 - - movd [r2], xm1 - pextrd [r2 + r3], xm1, 1 - pextrd [r2 + r3 * 2], xm1, 2 -.end: - RET - -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_2x8, 4, 7, 7 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - vbroadcasti128 m6, [pw_2000] - test r5d, r5d - jz .label - sub r0, r1 - -.label: - mova m4, [interp4_hpp_shuf] - mova m5, [pw_1] - dec r0 - lea r4, [r1 * 3] - movq xm1, [r0] ;row 0 - movhps xm1, [r0 + r1] - movq xm2, [r0 + r1 * 2] - movhps xm2, [r0 + r4] - vinserti128 m1, m1, xm2, 1 - lea r0, [r0 + r1 * 4] - movq xm3, [r0] - movhps xm3, [r0 + r1] - movq xm2, [r0 + r1 * 2] - movhps xm2, [r0 + r4] - vinserti128 m3, m3, xm2, 1 - - pshufb m1, m4 - pshufb m3, m4 - pmaddubsw m1, m0 - pmaddubsw m3, m0 - pmaddwd m1, m5 - pmaddwd m3, m5 - packssdw m1, m3 - psubw m1, m6 - - lea r4, [r3 * 3] - vextracti128 xm2, m1, 1 - - movd [r2], xm1 - pextrd [r2 + r3], xm1, 1 - movd [r2 + r3 * 2], xm2 - pextrd [r2 + r4], xm2, 1 - lea r2, [r2 + r3 * 4] - pextrd [r2], xm1, 2 - pextrd [r2 + r3], xm1, 3 - pextrd [r2 + r3 * 2], xm2, 2 - pextrd [r2 + r4], xm2, 3 - test r5d, r5d - jz .end - - lea r0, [r0 + r1 * 4] - lea r2, [r2 + r3 * 4] - movq xm1, [r0] ;row 0 - movhps xm1, [r0 + r1] - movq xm2, [r0 + r1 * 2] - vinserti128 m1, m1, xm2, 1 - pshufb m1, m4 - pmaddubsw m1, m0 - pmaddwd m1, m5 - packssdw m1, m1 - psubw m1, m6 - vextracti128 xm2, m1, 1 - - movd [r2], xm1 - pextrd [r2 + r3], xm1, 1 - movd [r2 + r3 * 2], xm2 -.end: - RET - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_12x16, 4, 6, 7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m6, [pw_512] - mova m1, [interp4_horiz_shuf1] - vpbroadcastd m2, [pw_1] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - mov r4d, 8 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - ; Row 1 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m5, [r0 + r1 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - - packuswb m3, m4 - vpermq m3, m3, 11011000b - - vextracti128 xm4, m3, 1 - movq [r2], xm3 - pextrd [r2+8], xm3, 2 - movq [r2 + r3], xm4 - pextrd [r2 + r3 + 8],xm4, 2 - lea r2, [r2 + r3 * 2] - lea r0, [r0 + r1 * 2] - dec r4d - jnz .loop - RET - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_24x32, 4,6,7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m1, [interp4_horiz_shuf1] - vpbroadcastd m2, [pw_1] - mova m6, [pw_512] - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - mov r4d, 32 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 4] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - vbroadcasti128 m4, [r0 + 16] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m5, [r0 + 20] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - - packuswb m3, m4 - vpermq m3, m3, 11011000b - - vextracti128 xm4, m3, 1 - movu [r2], xm3 - movq [r2 + 16], xm4 - add r2, r3 - add r0, r1 - dec r4d - jnz .loop - RET - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_6x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;-----------------------------------------------------------------------------------------------------------------------------; -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_6x8, 4,7,6 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - vbroadcasti128 m2, [pw_1] - vbroadcasti128 m5, [pw_2000] - mova m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - mov r6d, 8/2 - dec r0 - test r5d, r5d - jz .loop - sub r0 , r1 - inc r6d - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 1 - vbroadcasti128 m4, [r0 + r1] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - psubw m3, m5 - vpermq m3, m3, 11011000b - vextracti128 xm4, m3, 1 - movq [r2], xm3 - pextrd [r2 + 8], xm3, 2 - movq [r2 + r3], xm4 - pextrd [r2 + r3 + 8], xm4, 2 - lea r2, [r2 + r3 * 2] - lea r0, [r0 + r1 * 2] - dec r6d - jnz .loop - test r5d, r5d - jz .end - - ;Row 11 - vbroadcasti128 m3, [r0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - packssdw m3, m3 - psubw m3, m5 - vextracti128 xm4, m3, 1 - movq [r2], xm3 - movd [r2+8], xm4 -.end: - RET - -INIT_YMM avx2 -cglobal interp_8tap_horiz_ps_12x16, 6, 7, 8 - mov r5d, r5m - mov r4d, r4m -%ifdef PIC - lea r6, [h_tab_LumaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [h_tab_LumaCoeff + r4 * 8] -%endif - mova m6, [h_tab_Lm + 32] - mova m1, [h_tab_Lm] - add r3d, r3d - vbroadcasti128 m2, [pw_2000] - mov r4d, 16 - vbroadcasti128 m7, [pw_1] - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - pw_2000 - - mova m5, [h_interp8_hps_shuf] - sub r0, 3 - test r5d, r5d - jz .loop - lea r6, [r1 * 3] ; r6 = (N / 2 - 1) * srcStride - sub r0, r6 ; r0(src)-r6 - add r4d, 7 -.loop: - - ; Row 0 - - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m3, m6 - pshufb m3, m1 ; shuffled based on the col order tab_Lm - pmaddubsw m3, m0 - pmaddubsw m4, m0 - pmaddwd m3, m7 - pmaddwd m4, m7 - packssdw m3, m4 - - vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m7 - packssdw m4, m4 - - pmaddwd m3, m7 - pmaddwd m4, m7 - packssdw m3, m4 - - vpermd m3, m5, m3 - psubw m3, m2 - - vextracti128 xm4, m3, 1 - movu [r2], xm3 ;row 0 - movq [r2 + 16], xm4 ;row 1 - - add r0, r1 - add r2, r3 - dec r4d - jnz .loop - RET - -INIT_YMM avx2 -cglobal interp_8tap_horiz_ps_24x32, 4, 7, 8 - mov r5d, r5m - mov r4d, r4m -%ifdef PIC - lea r6, [h_tab_LumaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [h_tab_LumaCoeff + r4 * 8] -%endif - mova m6, [h_tab_Lm + 32] - mova m1, [h_tab_Lm] - mov r4d, 32 ;height - add r3d, r3d - vbroadcasti128 m2, [pw_2000] - vbroadcasti128 m7, [pw_1] - - ; register map - ; m0 - interpolate coeff - ; m1 , m6 - shuffle order table - ; m2 - pw_2000 - - sub r0, 3 - test r5d, r5d - jz .label - lea r6, [r1 * 3] ; r6 = (N / 2 - 1) * srcStride - sub r0, r6 ; r0(src)-r6 - add r4d, 7 ; blkheight += N - 1 (7 - 1 = 6 ; since the last one row not in loop) - -.label: - lea r6, [h_interp8_hps_shuf] -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m3, m6 ; row 0 (col 4 to 7) - pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3) - pmaddubsw m3, m0 - pmaddubsw m4, m0 - pmaddwd m3, m7 - pmaddwd m4, m7 - packssdw m3, m4 - - vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m4, m6 ;row 1 (col 4 to 7) - pshufb m4, m1 ;row 1 (col 0 to 3) - pmaddubsw m4, m0 - pmaddubsw m5, m0 - pmaddwd m4, m7 - pmaddwd m5, m7 - packssdw m4, m5 - pmaddwd m3, m7 - pmaddwd m4, m7 - packssdw m3, m4 - mova m5, [r6] - vpermd m3, m5, m3 - psubw m3, m2 - movu [r2], m3 ;row 0 - - vbroadcasti128 m3, [r0 + 16] - pshufb m4, m3, m6 - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddubsw m4, m0 - pmaddwd m3, m7 - pmaddwd m4, m7 - packssdw m3, m4 - pmaddwd m3, m7 - pmaddwd m4, m7 - packssdw m3, m4 - mova m4, [r6] - vpermd m3, m4, m3 - psubw m3, m2 - movu [r2 + 32], xm3 ;row 0 - - add r0, r1 - add r2, r3 - dec r4d - jnz .loop - RET - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_24x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;----------------------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_24x32, 4,7,6 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - vbroadcasti128 m2, [pw_1] - vbroadcasti128 m5, [pw_2000] - mova m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - mov r6d, 32 - dec r0 - test r5d, r5d - je .loop - sub r0 , r1 - add r6d , 3 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - psubw m3, m5 - vpermq m3, m3, 11011000b - movu [r2], m3 - - vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - packssdw m3, m3 - psubw m3, m5 - vpermq m3, m3, 11011000b - movu [r2 + 32], xm3 - - add r2, r3 - add r0, r1 - dec r6d - jnz .loop - RET - -;----------------------------------------------------------------------------------------------------------------------- -;macro FILTER_H8_W8_16N_AVX2 -;----------------------------------------------------------------------------------------------------------------------- -%macro FILTER_H8_W8_16N_AVX2 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m3, m6 ; row 0 (col 4 to 7) - pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3) - pmaddubsw m3, m0 - pmaddubsw m4, m0 - pmaddwd m3, m2 - pmaddwd m4, m2 - packssdw m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] - - vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m4, m6 ;row 1 (col 4 to 7) - pshufb m4, m1 ;row 1 (col 0 to 3) - pmaddubsw m4, m0 - pmaddubsw m5, m0 - pmaddwd m4, m2 - pmaddwd m5, m2 - packssdw m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] - - pmaddwd m3, m2 - pmaddwd m4, m2 - packssdw m3, m4 ; all rows and col completed. - - mova m5, [h_interp8_hps_shuf] - vpermd m3, m5, m3 - psubw m3, m8 - - vextracti128 xm4, m3, 1 - mova [r4], xm3 - mova [r4 + 16], xm4 - %endmacro - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_64xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -%macro IPFILTER_CHROMA_PP_64xN_AVX2 1 -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_64x%1, 4,6,7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m1, [interp4_horiz_shuf1] - vpbroadcastd m2, [pw_1] - mova m6, [pw_512] - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - mov r4d, %1 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 4] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - vbroadcasti128 m4, [r0 + 16] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m5, [r0 + 20] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - packuswb m3, m4 - vpermq m3, m3, 11011000b - movu [r2], m3 - - vbroadcasti128 m3, [r0 + 32] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 36] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - vbroadcasti128 m4, [r0 + 48] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m5, [r0 + 52] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - packuswb m3, m4 - vpermq m3, m3, 11011000b - movu [r2 + 32], m3 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop - RET -%endmacro - - IPFILTER_CHROMA_PP_64xN_AVX2 64 - IPFILTER_CHROMA_PP_64xN_AVX2 32 - IPFILTER_CHROMA_PP_64xN_AVX2 48 - IPFILTER_CHROMA_PP_64xN_AVX2 16 - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_48x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_48x64, 4,6,7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m1, [interp4_horiz_shuf1] - vpbroadcastd m2, [pw_1] - mova m6, [pw_512] - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - mov r4d, 64 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 4] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - vbroadcasti128 m4, [r0 + 16] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m5, [r0 + 20] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - - packuswb m3, m4 - vpermq m3, m3, q3120 - - movu [r2], m3 - - vbroadcasti128 m3, [r0 + mmsize] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + mmsize + 4] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - vbroadcasti128 m4, [r0 + mmsize + 16] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m5, [r0 + mmsize + 20] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - - packuswb m3, m4 - vpermq m3, m3, q3120 - movu [r2 + mmsize], xm3 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop - RET - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_48x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;-----------------------------------------------------------------------------------------------------------------------------; - -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_48x64, 4,7,6 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - vbroadcasti128 m2, [pw_1] - vbroadcasti128 m5, [pw_2000] - mova m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - mov r6d, 64 - dec r0 - test r5d, r5d - je .loop - sub r0 , r1 - add r6d , 3 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, m5 - vpermq m3, m3, q3120 - movu [r2], m3 - - vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 24] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, m5 - vpermq m3, m3, q3120 - movu [r2 + 32], m3 - - vbroadcasti128 m3, [r0 + 32] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 40] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, m5 - vpermq m3, m3, q3120 - movu [r2 + 64], m3 - - add r2, r3 - add r0, r1 - dec r6d - jnz .loop - RET - -;----------------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_24x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) -;----------------------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_24x64, 4,7,6 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - vbroadcasti128 m2, [pw_1] - vbroadcasti128 m5, [pw_2000] - mova m1, [h_tab_Tm] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - mov r6d, 64 - dec r0 - test r5d, r5d - je .loop - sub r0 , r1 - add r6d , 3 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - psubw m3, m5 - vpermq m3, m3, q3120 - movu [r2], m3 - - vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - packssdw m3, m3 - psubw m3, m5 - vpermq m3, m3, q3120 - movu [r2 + 32], xm3 - - add r2, r3 - add r0, r1 - dec r6d - jnz .loop - RET - -INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_2x16, 4, 7, 7 - mov r4d, r4m - mov r5d, r5m - add r3d, r3d - -%ifdef PIC - lea r6, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - vbroadcasti128 m6, [pw_2000] - test r5d, r5d - jz .label - sub r0, r1 - -.label: - mova m4, [interp4_hps_shuf] - mova m5, [pw_1] - dec r0 - lea r4, [r1 * 3] - movq xm1, [r0] ;row 0 - movhps xm1, [r0 + r1] - movq xm2, [r0 + r1 * 2] - movhps xm2, [r0 + r4] - vinserti128 m1, m1, xm2, 1 - lea r0, [r0 + r1 * 4] - movq xm3, [r0] - movhps xm3, [r0 + r1] - movq xm2, [r0 + r1 * 2] - movhps xm2, [r0 + r4] - vinserti128 m3, m3, xm2, 1 - - pshufb m1, m4 - pshufb m3, m4 - pmaddubsw m1, m0 - pmaddubsw m3, m0 - pmaddwd m1, m5 - pmaddwd m3, m5 - packssdw m1, m3 - psubw m1, m6 - - lea r4, [r3 * 3] - vextracti128 xm2, m1, 1 - - movd [r2], xm1 - pextrd [r2 + r3], xm1, 1 - movd [r2 + r3 * 2], xm2 - pextrd [r2 + r4], xm2, 1 - lea r2, [r2 + r3 * 4] - pextrd [r2], xm1, 2 - pextrd [r2 + r3], xm1, 3 - pextrd [r2 + r3 * 2], xm2, 2 - pextrd [r2 + r4], xm2, 3 - - lea r0, [r0 + r1 * 4] - lea r2, [r2 + r3 * 4] - lea r4, [r1 * 3] - movq xm1, [r0] - movhps xm1, [r0 + r1] - movq xm2, [r0 + r1 * 2] - movhps xm2, [r0 + r4] - vinserti128 m1, m1, xm2, 1 - lea r0, [r0 + r1 * 4] - movq xm3, [r0] - movhps xm3, [r0 + r1] - movq xm2, [r0 + r1 * 2] - movhps xm2, [r0 + r4] - vinserti128 m3, m3, xm2, 1 - - pshufb m1, m4 - pshufb m3, m4 - pmaddubsw m1, m0 - pmaddubsw m3, m0 - pmaddwd m1, m5 - pmaddwd m3, m5 - packssdw m1, m3 - psubw m1, m6 - - lea r4, [r3 * 3] - vextracti128 xm2, m1, 1 - - movd [r2], xm1 - pextrd [r2 + r3], xm1, 1 - movd [r2 + r3 * 2], xm2 - pextrd [r2 + r4], xm2, 1 - lea r2, [r2 + r3 * 4] - pextrd [r2], xm1, 2 - pextrd [r2 + r3], xm1, 3 - pextrd [r2 + r3 * 2], xm2, 2 - pextrd [r2 + r4], xm2, 3 - - test r5d, r5d - jz .end - - lea r0, [r0 + r1 * 4] - lea r2, [r2 + r3 * 4] - movq xm1, [r0] - movhps xm1, [r0 + r1] - movq xm2, [r0 + r1 * 2] - vinserti128 m1, m1, xm2, 1 - pshufb m1, m4 - pmaddubsw m1, m0 - pmaddwd m1, m5 - packssdw m1, m1 - psubw m1, m6 - vextracti128 xm2, m1, 1 - - movd [r2], xm1 - pextrd [r2 + r3], xm1, 1 - movd [r2 + r3 * 2], xm2 -.end: - RET - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_6x16, 4, 6, 7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m1, [h_tab_Tm] - mova m2, [pw_1] - mova m6, [pw_512] - lea r4, [r1 * 3] - lea r5, [r3 * 3] - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 -%rep 4 - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - - ; Row 1 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - ; Row 2 - vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - ; Row 3 - vbroadcasti128 m5, [r0 + r4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - - packuswb m3, m4 - vextracti128 xm4, m3, 1 - movd [r2], xm3 - pextrw [r2 + 4], xm4, 0 - pextrd [r2 + r3], xm3, 1 - pextrw [r2 + r3 + 4], xm4, 2 - pextrd [r2 + r3 * 2], xm3, 2 - pextrw [r2 + r3 * 2 + 4], xm4, 4 - pextrd [r2 + r5], xm3, 3 - pextrw [r2 + r5 + 4], xm4, 6 - lea r2, [r2 + r3 * 4] - lea r0, [r0 + r1 * 4] -%endrep - RET - -;----------------------------------------------------------------------------- -; void interp_8tap_hv_pp_16x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY) -;----------------------------------------------------------------------------- -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_8tap_hv_pp_16x16, 4, 10, 15, 0-31*32 -%define stk_buf1 rsp - mov r4d, r4m - mov r5d, r5m -%ifdef PIC - lea r6, [h_tab_LumaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [h_tab_LumaCoeff + r4 * 8] -%endif - - xor r6, r6 - mov r4, rsp - mova m6, [h_tab_Lm + 32] - mova m1, [h_tab_Lm] - mov r8, 16 ;height - vbroadcasti128 m8, [pw_2000] - vbroadcasti128 m2, [pw_1] - sub r0, 3 - lea r7, [r1 * 3] ; r7 = (N / 2 - 1) * srcStride - sub r0, r7 ; r0(src)-r7 - add r8, 7 - -.loopH: - FILTER_H8_W8_16N_AVX2 - add r0, r1 - add r4, 32 - inc r6 - cmp r6, 16+7 - jnz .loopH - -; vertical phase - xor r6, r6 - xor r1, r1 -.loopV: - -;load necessary variables - mov r4d, r5d ;coeff here for vertical is r5m - shl r4d, 7 - mov r1d, 16 - add r1d, r1d - - ; load intermedia buffer - mov r0, stk_buf1 - - ; register mapping - ; r0 - src - ; r5 - coeff - ; r6 - loop_i - -; load coeff table -%ifdef PIC - lea r5, [h_pw_LumaCoeffVer] - add r5, r4 -%else - lea r5, [h_pw_LumaCoeffVer + r4] -%endif - - lea r4, [r1*3] - mova m14, [h_pd_526336] - lea r6, [r3 * 3] - mov r9d, 16 / 8 - -.loopW: - PROCESS_LUMA_AVX2_W8_16R sp - add r2, 8 - add r0, 16 - dec r9d - jnz .loopW - RET -%endif - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_12x32, 4, 6, 7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m6, [pw_512] - mova m1, [interp4_horiz_shuf1] - vpbroadcastd m2, [pw_1] - - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - mov r4d, 16 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - ; Row 1 - vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m5, [r0 + r1 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - - packuswb m3, m4 - vpermq m3, m3, 11011000b - - vextracti128 xm4, m3, 1 - movq [r2], xm3 - pextrd [r2+8], xm3, 2 - movq [r2 + r3], xm4 - pextrd [r2 + r3 + 8],xm4, 2 - lea r2, [r2 + r3 * 2] - lea r0, [r0 + r1 * 2] - dec r4d - jnz .loop - RET - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_24x64, 4,6,7 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m1, [interp4_horiz_shuf1] - vpbroadcastd m2, [pw_1] - mova m6, [pw_512] - ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - - dec r0 - mov r4d, 64 - -.loop: - ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 4] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - packssdw m3, m4 - pmulhrsw m3, m6 - - vbroadcasti128 m4, [r0 + 16] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - vbroadcasti128 m5, [r0 + 20] - pshufb m5, m1 - pmaddubsw m5, m0 - pmaddwd m5, m2 - packssdw m4, m5 - pmulhrsw m4, m6 - - packuswb m3, m4 - vpermq m3, m3, 11011000b - - vextracti128 xm4, m3, 1 - movu [r2], xm3 - movq [r2 + 16], xm4 - add r2, r3 - add r0, r1 - dec r4d - jnz .loop - RET - - -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_2x16, 4, 6, 6 - mov r4d, r4m - -%ifdef PIC - lea r5, [h_tab_ChromaCoeff] - vpbroadcastd m0, [r5 + r4 * 4] -%else - vpbroadcastd m0, [h_tab_ChromaCoeff + r4 * 4] -%endif - - mova m4, [interp4_hpp_shuf] - mova m5, [pw_1] - dec r0 - lea r4, [r1 * 3] - movq xm1, [r0] - movhps xm1, [r0 + r1] - movq xm2, [r0 + r1 * 2] - movhps xm2, [r0 + r4] - vinserti128 m1, m1, xm2, 1 - lea r0, [r0 + r1 * 4] - movq xm3, [r0] - movhps xm3, [r0 + r1] - movq xm2, [r0 + r1 * 2] - movhps xm2, [r0 + r4] - vinserti128 m3, m3, xm2, 1 - - pshufb m1, m4 - pshufb m3, m4 - pmaddubsw m1, m0 - pmaddubsw m3, m0 - pmaddwd m1, m5 - pmaddwd m3, m5 - packssdw m1, m3 - pmulhrsw m1, [pw_512] - vextracti128 xm2, m1, 1 - packuswb xm1, xm2 - - lea r4, [r3 * 3] - pextrw [r2], xm1, 0 - pextrw [r2 + r3], xm1, 1 - pextrw [r2 + r3 * 2], xm1, 4 - pextrw [r2 + r4], xm1, 5 - lea r2, [r2 + r3 * 4] - pextrw [r2], xm1, 2 - pextrw [r2 + r3], xm1, 3 - pextrw [r2 + r3 * 2], xm1, 6 - pextrw [r2 + r4], xm1, 7 - lea r2, [r2 + r3 * 4] - lea r0, [r0 + r1 * 4] - - lea r4, [r1 * 3] - movq xm1, [r0] - movhps xm1, [r0 + r1] - movq xm2, [r0 + r1 * 2] - movhps xm2, [r0 + r4] - vinserti128 m1, m1, xm2, 1 - lea r0, [r0 + r1 * 4] - movq xm3, [r0] - movhps xm3, [r0 + r1] - movq xm2, [r0 + r1 * 2] - movhps xm2, [r0 + r4] - vinserti128 m3, m3, xm2, 1 - - pshufb m1, m4 - pshufb m3, m4 - pmaddubsw m1, m0 - pmaddubsw m3, m0 - pmaddwd m1, m5 - pmaddwd m3, m5 - packssdw m1, m3 - pmulhrsw m1, [pw_512] - vextracti128 xm2, m1, 1 - packuswb xm1, xm2 - - lea r4, [r3 * 3] - pextrw [r2], xm1, 0 - pextrw [r2 + r3], xm1, 1 - pextrw [r2 + r3 * 2], xm1, 4 - pextrw [r2 + r4], xm1, 5 - lea r2, [r2 + r3 * 4] - pextrw [r2], xm1, 2 - pextrw [r2 + r3], xm1, 3 - pextrw [r2 + r3 * 2], xm1, 6 - pextrw [r2 + r4], xm1, 7 - RET
View file
x265_2.7.tar.gz/source/common/x86/h4-ipfilter16.asm
Deleted
@@ -1,2632 +0,0 @@ -;***************************************************************************** -;* Copyright (C) 2013-2017 MulticoreWare, Inc -;* -;* Authors: Nabajit Deka <nabajit@multicorewareinc.com> -;* Murugan Vairavel <murugan@multicorewareinc.com> -;* Min Chen <chenm003@163.com> -;* -;* This program is free software; you can redistribute it and/or modify -;* it under the terms of the GNU General Public License as published by -;* the Free Software Foundation; either version 2 of the License, or -;* (at your option) any later version. -;* -;* This program is distributed in the hope that it will be useful, -;* but WITHOUT ANY WARRANTY; without even the implied warranty of -;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -;* GNU General Public License for more details. -;* -;* You should have received a copy of the GNU General Public License -;* along with this program; if not, write to the Free Software -;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. -;* -;* This program is also available under a commercial proprietary license. -;* For more information, contact us at license @ x265.com. -;*****************************************************************************/ - -%include "x86inc.asm" -%include "x86util.asm" - - -%define INTERP_OFFSET_PP pd_32 -%define INTERP_SHIFT_PP 6 - -%if BIT_DEPTH == 10 - %define INTERP_SHIFT_PS 2 - %define INTERP_OFFSET_PS pd_n32768 - %define INTERP_SHIFT_SP 10 - %define INTERP_OFFSET_SP h4_pd_524800 -%elif BIT_DEPTH == 12 - %define INTERP_SHIFT_PS 4 - %define INTERP_OFFSET_PS pd_n131072 - %define INTERP_SHIFT_SP 8 - %define INTERP_OFFSET_SP pd_524416 -%else - %error Unsupport bit depth! -%endif - - -SECTION_RODATA 32 - -tab_c_32: times 8 dd 32 -h4_pd_524800: times 8 dd 524800 - -tab_Tm16: db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9 - -tab_ChromaCoeff: dw 0, 64, 0, 0 - dw -2, 58, 10, -2 - dw -4, 54, 16, -2 - dw -6, 46, 28, -4 - dw -4, 36, 36, -4 - dw -4, 28, 46, -6 - dw -2, 16, 54, -4 - dw -2, 10, 58, -2 - -const h4_interp8_hpp_shuf, db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9 - db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13 - -SECTION .text -cextern pd_8 -cextern pd_32 -cextern pw_pixel_max -cextern pd_524416 -cextern pd_n32768 -cextern pd_n131072 -cextern pw_2000 -cextern idct8_shuf2 - -%macro FILTERH_W2_4_sse3 2 - movh m3, [r0 + %1] - movhps m3, [r0 + %1 + 2] - pmaddwd m3, m0 - movh m4, [r0 + r1 + %1] - movhps m4, [r0 + r1 + %1 + 2] - pmaddwd m4, m0 - pshufd m2, m3, q2301 - paddd m3, m2 - pshufd m2, m4, q2301 - paddd m4, m2 - pshufd m3, m3, q3120 - pshufd m4, m4, q3120 - punpcklqdq m3, m4 - paddd m3, m1 - movh m5, [r0 + 2 * r1 + %1] - movhps m5, [r0 + 2 * r1 + %1 + 2] - pmaddwd m5, m0 - movh m4, [r0 + r4 + %1] - movhps m4, [r0 + r4 + %1 + 2] - pmaddwd m4, m0 - pshufd m2, m5, q2301 - paddd m5, m2 - pshufd m2, m4, q2301 - paddd m4, m2 - pshufd m5, m5, q3120 - pshufd m4, m4, q3120 - punpcklqdq m5, m4 - paddd m5, m1 -%ifidn %2, pp - psrad m3, 6 - psrad m5, 6 - packssdw m3, m5 - CLIPW m3, m7, m6 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movd [r2 + %1], m3 - psrldq m3, 4 - movd [r2 + r3 + %1], m3 - psrldq m3, 4 - movd [r2 + r3 * 2 + %1], m3 - psrldq m3, 4 - movd [r2 + r5 + %1], m3 -%endmacro - -%macro FILTERH_W2_3_sse3 1 - movh m3, [r0 + %1] - movhps m3, [r0 + %1 + 2] - pmaddwd m3, m0 - movh m4, [r0 + r1 + %1] - movhps m4, [r0 + r1 + %1 + 2] - pmaddwd m4, m0 - pshufd m2, m3, q2301 - paddd m3, m2 - pshufd m2, m4, q2301 - paddd m4, m2 - pshufd m3, m3, q3120 - pshufd m4, m4, q3120 - punpcklqdq m3, m4 - paddd m3, m1 - - movh m5, [r0 + 2 * r1 + %1] - movhps m5, [r0 + 2 * r1 + %1 + 2] - pmaddwd m5, m0 - - pshufd m2, m5, q2301 - paddd m5, m2 - pshufd m5, m5, q3120 - paddd m5, m1 - - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 - - movd [r2 + %1], m3 - psrldq m3, 4 - movd [r2 + r3 + %1], m3 - psrldq m3, 4 - movd [r2 + r3 * 2 + %1], m3 -%endmacro - -%macro FILTERH_W4_2_sse3 2 - movh m3, [r0 + %1] - movhps m3, [r0 + %1 + 2] - pmaddwd m3, m0 - movh m4, [r0 + %1 + 4] - movhps m4, [r0 + %1 + 6] - pmaddwd m4, m0 - pshufd m2, m3, q2301 - paddd m3, m2 - pshufd m2, m4, q2301 - paddd m4, m2 - pshufd m3, m3, q3120 - pshufd m4, m4, q3120 - punpcklqdq m3, m4 - paddd m3, m1 - - movh m5, [r0 + r1 + %1] - movhps m5, [r0 + r1 + %1 + 2] - pmaddwd m5, m0 - movh m4, [r0 + r1 + %1 + 4] - movhps m4, [r0 + r1 + %1 + 6] - pmaddwd m4, m0 - pshufd m2, m5, q2301 - paddd m5, m2 - pshufd m2, m4, q2301 - paddd m4, m2 - pshufd m5, m5, q3120 - pshufd m4, m4, q3120 - punpcklqdq m5, m4 - paddd m5, m1 -%ifidn %2, pp - psrad m3, 6 - psrad m5, 6 - packssdw m3, m5 - CLIPW m3, m7, m6 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movh [r2 + %1], m3 - movhps [r2 + r3 + %1], m3 -%endmacro - -%macro FILTERH_W4_1_sse3 1 - movh m3, [r0 + 2 * r1 + %1] - movhps m3, [r0 + 2 * r1 + %1 + 2] - pmaddwd m3, m0 - movh m4, [r0 + 2 * r1 + %1 + 4] - movhps m4, [r0 + 2 * r1 + %1 + 6] - pmaddwd m4, m0 - pshufd m2, m3, q2301 - paddd m3, m2 - pshufd m2, m4, q2301 - paddd m4, m2 - pshufd m3, m3, q3120 - pshufd m4, m4, q3120 - punpcklqdq m3, m4 - paddd m3, m1 - - psrad m3, INTERP_SHIFT_PS - packssdw m3, m3 - movh [r2 + r3 * 2 + %1], m3 -%endmacro - -%macro FILTERH_W8_1_sse3 2 - movh m3, [r0 + %1] - movhps m3, [r0 + %1 + 2] - pmaddwd m3, m0 - movh m4, [r0 + %1 + 4] - movhps m4, [r0 + %1 + 6] - pmaddwd m4, m0 - pshufd m2, m3, q2301 - paddd m3, m2 - pshufd m2, m4, q2301 - paddd m4, m2 - pshufd m3, m3, q3120 - pshufd m4, m4, q3120 - punpcklqdq m3, m4 - paddd m3, m1 - - movh m5, [r0 + %1 + 8] - movhps m5, [r0 + %1 + 10] - pmaddwd m5, m0 - movh m4, [r0 + %1 + 12] - movhps m4, [r0 + %1 + 14] - pmaddwd m4, m0 - pshufd m2, m5, q2301 - paddd m5, m2 - pshufd m2, m4, q2301 - paddd m4, m2 - pshufd m5, m5, q3120 - pshufd m4, m4, q3120 - punpcklqdq m5, m4 - paddd m5, m1 -%ifidn %2, pp - psrad m3, 6 - psrad m5, 6 - packssdw m3, m5 - CLIPW m3, m7, m6 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movdqu [r2 + %1], m3 -%endmacro - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_HOR_CHROMA_sse3 3 -INIT_XMM sse3 -cglobal interp_4tap_horiz_%3_%1x%2, 4, 7, 8 - add r3, r3 - add r1, r1 - sub r0, 2 - mov r4d, r4m - add r4d, r4d - -%ifdef PIC - lea r6, [tab_ChromaCoeff] - movddup m0, [r6 + r4 * 4] -%else - movddup m0, [tab_ChromaCoeff + r4 * 4] -%endif - -%ifidn %3, ps - mova m1, [INTERP_OFFSET_PS] - cmp r5m, byte 0 -%if %1 <= 6 - lea r4, [r1 * 3] - lea r5, [r3 * 3] -%endif - je .skip - sub r0, r1 -%if %1 <= 6 -%assign y 1 -%else -%assign y 3 -%endif -%assign z 0 -%rep y -%assign x 0 -%rep %1/8 - FILTERH_W8_1_sse3 x, %3 -%assign x x+16 -%endrep -%if %1 == 4 || (%1 == 6 && z == 0) || (%1 == 12 && z == 0) - FILTERH_W4_2_sse3 x, %3 - FILTERH_W4_1_sse3 x -%assign x x+8 -%endif -%if %1 == 2 || (%1 == 6 && z == 0) - FILTERH_W2_3_sse3 x -%endif -%if %1 <= 6 - lea r0, [r0 + r4] - lea r2, [r2 + r5] -%else - lea r0, [r0 + r1] - lea r2, [r2 + r3] -%endif -%assign z z+1 -%endrep -.skip: -%elifidn %3, pp - pxor m7, m7 - mova m6, [pw_pixel_max] - mova m1, [tab_c_32] -%if %1 == 2 || %1 == 6 - lea r4, [r1 * 3] - lea r5, [r3 * 3] -%endif -%endif - -%if %1 == 2 -%assign y %2/4 -%elif %1 <= 6 -%assign y %2/2 -%else -%assign y %2 -%endif -%assign z 0 -%rep y -%assign x 0 -%rep %1/8 - FILTERH_W8_1_sse3 x, %3 -%assign x x+16 -%endrep -%if %1 == 4 || %1 == 6 || (%1 == 12 && (z % 2) == 0) - FILTERH_W4_2_sse3 x, %3 -%assign x x+8 -%endif -%if %1 == 2 || (%1 == 6 && (z % 2) == 0) - FILTERH_W2_4_sse3 x, %3 -%endif -%assign z z+1 -%if z < y -%if %1 == 2 - lea r0, [r0 + 4 * r1] - lea r2, [r2 + 4 * r3] -%elif %1 <= 6 - lea r0, [r0 + 2 * r1] - lea r2, [r2 + 2 * r3] -%else - lea r0, [r0 + r1] - lea r2, [r2 + r3] -%endif -%endif ;z < y -%endrep - - RET -%endmacro - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- - -FILTER_HOR_CHROMA_sse3 2, 4, pp -FILTER_HOR_CHROMA_sse3 2, 8, pp -FILTER_HOR_CHROMA_sse3 2, 16, pp -FILTER_HOR_CHROMA_sse3 4, 2, pp -FILTER_HOR_CHROMA_sse3 4, 4, pp -FILTER_HOR_CHROMA_sse3 4, 8, pp -FILTER_HOR_CHROMA_sse3 4, 16, pp -FILTER_HOR_CHROMA_sse3 4, 32, pp -FILTER_HOR_CHROMA_sse3 6, 8, pp -FILTER_HOR_CHROMA_sse3 6, 16, pp -FILTER_HOR_CHROMA_sse3 8, 2, pp -FILTER_HOR_CHROMA_sse3 8, 4, pp -FILTER_HOR_CHROMA_sse3 8, 6, pp -FILTER_HOR_CHROMA_sse3 8, 8, pp -FILTER_HOR_CHROMA_sse3 8, 12, pp -FILTER_HOR_CHROMA_sse3 8, 16, pp -FILTER_HOR_CHROMA_sse3 8, 32, pp -FILTER_HOR_CHROMA_sse3 8, 64, pp -FILTER_HOR_CHROMA_sse3 12, 16, pp -FILTER_HOR_CHROMA_sse3 12, 32, pp -FILTER_HOR_CHROMA_sse3 16, 4, pp -FILTER_HOR_CHROMA_sse3 16, 8, pp -FILTER_HOR_CHROMA_sse3 16, 12, pp -FILTER_HOR_CHROMA_sse3 16, 16, pp -FILTER_HOR_CHROMA_sse3 16, 24, pp -FILTER_HOR_CHROMA_sse3 16, 32, pp -FILTER_HOR_CHROMA_sse3 16, 64, pp -FILTER_HOR_CHROMA_sse3 24, 32, pp -FILTER_HOR_CHROMA_sse3 24, 64, pp -FILTER_HOR_CHROMA_sse3 32, 8, pp -FILTER_HOR_CHROMA_sse3 32, 16, pp -FILTER_HOR_CHROMA_sse3 32, 24, pp -FILTER_HOR_CHROMA_sse3 32, 32, pp -FILTER_HOR_CHROMA_sse3 32, 48, pp -FILTER_HOR_CHROMA_sse3 32, 64, pp -FILTER_HOR_CHROMA_sse3 48, 64, pp -FILTER_HOR_CHROMA_sse3 64, 16, pp -FILTER_HOR_CHROMA_sse3 64, 32, pp -FILTER_HOR_CHROMA_sse3 64, 48, pp -FILTER_HOR_CHROMA_sse3 64, 64, pp - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- - -FILTER_HOR_CHROMA_sse3 2, 4, ps -FILTER_HOR_CHROMA_sse3 2, 8, ps -FILTER_HOR_CHROMA_sse3 2, 16, ps -FILTER_HOR_CHROMA_sse3 4, 2, ps -FILTER_HOR_CHROMA_sse3 4, 4, ps -FILTER_HOR_CHROMA_sse3 4, 8, ps -FILTER_HOR_CHROMA_sse3 4, 16, ps -FILTER_HOR_CHROMA_sse3 4, 32, ps -FILTER_HOR_CHROMA_sse3 6, 8, ps -FILTER_HOR_CHROMA_sse3 6, 16, ps -FILTER_HOR_CHROMA_sse3 8, 2, ps -FILTER_HOR_CHROMA_sse3 8, 4, ps -FILTER_HOR_CHROMA_sse3 8, 6, ps -FILTER_HOR_CHROMA_sse3 8, 8, ps -FILTER_HOR_CHROMA_sse3 8, 12, ps -FILTER_HOR_CHROMA_sse3 8, 16, ps -FILTER_HOR_CHROMA_sse3 8, 32, ps -FILTER_HOR_CHROMA_sse3 8, 64, ps -FILTER_HOR_CHROMA_sse3 12, 16, ps -FILTER_HOR_CHROMA_sse3 12, 32, ps -FILTER_HOR_CHROMA_sse3 16, 4, ps -FILTER_HOR_CHROMA_sse3 16, 8, ps -FILTER_HOR_CHROMA_sse3 16, 12, ps -FILTER_HOR_CHROMA_sse3 16, 16, ps -FILTER_HOR_CHROMA_sse3 16, 24, ps -FILTER_HOR_CHROMA_sse3 16, 32, ps -FILTER_HOR_CHROMA_sse3 16, 64, ps -FILTER_HOR_CHROMA_sse3 24, 32, ps -FILTER_HOR_CHROMA_sse3 24, 64, ps -FILTER_HOR_CHROMA_sse3 32, 8, ps -FILTER_HOR_CHROMA_sse3 32, 16, ps -FILTER_HOR_CHROMA_sse3 32, 24, ps -FILTER_HOR_CHROMA_sse3 32, 32, ps -FILTER_HOR_CHROMA_sse3 32, 48, ps -FILTER_HOR_CHROMA_sse3 32, 64, ps -FILTER_HOR_CHROMA_sse3 48, 64, ps -FILTER_HOR_CHROMA_sse3 64, 16, ps -FILTER_HOR_CHROMA_sse3 64, 32, ps -FILTER_HOR_CHROMA_sse3 64, 48, ps -FILTER_HOR_CHROMA_sse3 64, 64, ps - -%macro FILTER_W2_2 1 - movu m3, [r0] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + r1] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - packusdw m3, m3 - CLIPW m3, m7, m6 -%else - psrad m3, INTERP_SHIFT_PS - packssdw m3, m3 -%endif - movd [r2], m3 - pextrd [r2 + r3], m3, 1 -%endmacro - -%macro FILTER_W4_2 1 - movu m3, [r0] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + 4] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - - movu m5, [r0 + r1] - pshufb m5, m5, m2 - pmaddwd m5, m0 - movu m4, [r0 + r1 + 4] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m5, m4 - paddd m5, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m3, m5 - CLIPW m3, m7, m6 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movh [r2], m3 - movhps [r2 + r3], m3 -%endmacro - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_CHROMA_H 6 -INIT_XMM sse4 -cglobal interp_4tap_horiz_%3_%1x%2, 4, %4, %5 - - add r3, r3 - add r1, r1 - sub r0, 2 - mov r4d, r4m - add r4d, r4d - -%ifdef PIC - lea r%6, [tab_ChromaCoeff] - movh m0, [r%6 + r4 * 4] -%else - movh m0, [tab_ChromaCoeff + r4 * 4] -%endif - - punpcklqdq m0, m0 - mova m2, [tab_Tm16] - -%ifidn %3, ps - mova m1, [INTERP_OFFSET_PS] - cmp r5m, byte 0 - je .skip - sub r0, r1 - movu m3, [r0] - pshufb m3, m3, m2 - pmaddwd m3, m0 - - %if %1 == 4 - movu m4, [r0 + 4] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - %else - phaddd m3, m3 - %endif - - paddd m3, m1 - psrad m3, INTERP_SHIFT_PS - packssdw m3, m3 - - %if %1 == 2 - movd [r2], m3 - %else - movh [r2], m3 - %endif - - add r0, r1 - add r2, r3 - FILTER_W%1_2 %3 - lea r0, [r0 + 2 * r1] - lea r2, [r2 + 2 * r3] - -.skip: - -%else ;%ifidn %3, ps - pxor m7, m7 - mova m6, [pw_pixel_max] - mova m1, [tab_c_32] -%endif ;%ifidn %3, ps - - FILTER_W%1_2 %3 - -%rep (%2/2) - 1 - lea r0, [r0 + 2 * r1] - lea r2, [r2 + 2 * r3] - FILTER_W%1_2 %3 -%endrep - RET -%endmacro - -FILTER_CHROMA_H 2, 4, pp, 6, 8, 5 -FILTER_CHROMA_H 2, 8, pp, 6, 8, 5 -FILTER_CHROMA_H 4, 2, pp, 6, 8, 5 -FILTER_CHROMA_H 4, 4, pp, 6, 8, 5 -FILTER_CHROMA_H 4, 8, pp, 6, 8, 5 -FILTER_CHROMA_H 4, 16, pp, 6, 8, 5 - -FILTER_CHROMA_H 2, 4, ps, 7, 5, 6 -FILTER_CHROMA_H 2, 8, ps, 7, 5, 6 -FILTER_CHROMA_H 4, 2, ps, 7, 6, 6 -FILTER_CHROMA_H 4, 4, ps, 7, 6, 6 -FILTER_CHROMA_H 4, 8, ps, 7, 6, 6 -FILTER_CHROMA_H 4, 16, ps, 7, 6, 6 - -FILTER_CHROMA_H 2, 16, pp, 6, 8, 5 -FILTER_CHROMA_H 4, 32, pp, 6, 8, 5 -FILTER_CHROMA_H 2, 16, ps, 7, 5, 6 -FILTER_CHROMA_H 4, 32, ps, 7, 6, 6 - - -%macro FILTER_W6_1 1 - movu m3, [r0] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + 4] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - - movu m4, [r0 + 8] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m4, m4 - paddd m4, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - psrad m4, INTERP_SHIFT_PP - packusdw m3, m4 - CLIPW m3, m6, m7 -%else - psrad m3, INTERP_SHIFT_PS - psrad m4, INTERP_SHIFT_PS - packssdw m3, m4 -%endif - movh [r2], m3 - pextrd [r2 + 8], m3, 2 -%endmacro - -cglobal chroma_filter_pp_6x1_internal - FILTER_W6_1 pp - ret - -cglobal chroma_filter_ps_6x1_internal - FILTER_W6_1 ps - ret - -%macro FILTER_W8_1 1 - movu m3, [r0] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + 4] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - - movu m5, [r0 + 8] - pshufb m5, m5, m2 - pmaddwd m5, m0 - movu m4, [r0 + 12] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m5, m4 - paddd m5, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m3, m5 - CLIPW m3, m6, m7 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movh [r2], m3 - movhps [r2 + 8], m3 -%endmacro - -cglobal chroma_filter_pp_8x1_internal - FILTER_W8_1 pp - ret - -cglobal chroma_filter_ps_8x1_internal - FILTER_W8_1 ps - ret - -%macro FILTER_W12_1 1 - movu m3, [r0] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + 4] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - - movu m5, [r0 + 8] - pshufb m5, m5, m2 - pmaddwd m5, m0 - movu m4, [r0 + 12] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m5, m4 - paddd m5, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m3, m5 - CLIPW m3, m6, m7 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movh [r2], m3 - movhps [r2 + 8], m3 - - movu m3, [r0 + 16] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + 20] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - packusdw m3, m3 - CLIPW m3, m6, m7 -%else - psrad m3, INTERP_SHIFT_PS - packssdw m3, m3 -%endif - movh [r2 + 16], m3 -%endmacro - -cglobal chroma_filter_pp_12x1_internal - FILTER_W12_1 pp - ret - -cglobal chroma_filter_ps_12x1_internal - FILTER_W12_1 ps - ret - -%macro FILTER_W16_1 1 - movu m3, [r0] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + 4] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - - movu m5, [r0 + 8] - pshufb m5, m5, m2 - pmaddwd m5, m0 - movu m4, [r0 + 12] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m5, m4 - paddd m5, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m3, m5 - CLIPW m3, m6, m7 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movh [r2], m3 - movhps [r2 + 8], m3 - - movu m3, [r0 + 16] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + 20] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - - movu m5, [r0 + 24] - pshufb m5, m5, m2 - pmaddwd m5, m0 - movu m4, [r0 + 28] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m5, m4 - paddd m5, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m3, m5 - CLIPW m3, m6, m7 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movh [r2 + 16], m3 - movhps [r2 + 24], m3 -%endmacro - -cglobal chroma_filter_pp_16x1_internal - FILTER_W16_1 pp - ret - -cglobal chroma_filter_ps_16x1_internal - FILTER_W16_1 ps - ret - -%macro FILTER_W24_1 1 - movu m3, [r0] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + 4] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - - movu m5, [r0 + 8] - pshufb m5, m5, m2 - pmaddwd m5, m0 - movu m4, [r0 + 12] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m5, m4 - paddd m5, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m3, m5 - CLIPW m3, m6, m7 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movh [r2], m3 - movhps [r2 + 8], m3 - - movu m3, [r0 + 16] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + 20] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - - movu m5, [r0 + 24] - pshufb m5, m5, m2 - pmaddwd m5, m0 - movu m4, [r0 + 28] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m5, m4 - paddd m5, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m3, m5 - CLIPW m3, m6, m7 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movh [r2 + 16], m3 - movhps [r2 + 24], m3 - - movu m3, [r0 + 32] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + 36] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - - movu m5, [r0 + 40] - pshufb m5, m5, m2 - pmaddwd m5, m0 - movu m4, [r0 + 44] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m5, m4 - paddd m5, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m3, m5 - CLIPW m3, m6, m7 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movh [r2 + 32], m3 - movhps [r2 + 40], m3 -%endmacro - -cglobal chroma_filter_pp_24x1_internal - FILTER_W24_1 pp - ret - -cglobal chroma_filter_ps_24x1_internal - FILTER_W24_1 ps - ret - -%macro FILTER_W32_1 1 - movu m3, [r0] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + 4] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - - movu m5, [r0 + 8] - pshufb m5, m5, m2 - pmaddwd m5, m0 - movu m4, [r0 + 12] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m5, m4 - paddd m5, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m3, m5 - CLIPW m3, m6, m7 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movh [r2], m3 - movhps [r2 + 8], m3 - - movu m3, [r0 + 16] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + 20] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - - movu m5, [r0 + 24] - pshufb m5, m5, m2 - pmaddwd m5, m0 - movu m4, [r0 + 28] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m5, m4 - paddd m5, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m3, m5 - CLIPW m3, m6, m7 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movh [r2 + 16], m3 - movhps [r2 + 24], m3 - - movu m3, [r0 + 32] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + 36] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - - movu m5, [r0 + 40] - pshufb m5, m5, m2 - pmaddwd m5, m0 - movu m4, [r0 + 44] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m5, m4 - paddd m5, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m3, m5 - CLIPW m3, m6, m7 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movh [r2 + 32], m3 - movhps [r2 + 40], m3 - - movu m3, [r0 + 48] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + 52] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - - movu m5, [r0 + 56] - pshufb m5, m5, m2 - pmaddwd m5, m0 - movu m4, [r0 + 60] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m5, m4 - paddd m5, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m3, m5 - CLIPW m3, m6, m7 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movh [r2 + 48], m3 - movhps [r2 + 56], m3 -%endmacro - -cglobal chroma_filter_pp_32x1_internal - FILTER_W32_1 pp - ret - -cglobal chroma_filter_ps_32x1_internal - FILTER_W32_1 ps - ret - -%macro FILTER_W8o_1 2 - movu m3, [r0 + %2] - pshufb m3, m3, m2 - pmaddwd m3, m0 - movu m4, [r0 + %2 + 4] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m1 - - movu m5, [r0 + %2 + 8] - pshufb m5, m5, m2 - pmaddwd m5, m0 - movu m4, [r0 + %2 + 12] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m5, m4 - paddd m5, m1 -%ifidn %1, pp - psrad m3, INTERP_SHIFT_PP - psrad m5, INTERP_SHIFT_PP - packusdw m3, m5 - CLIPW m3, m6, m7 -%else - psrad m3, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS - packssdw m3, m5 -%endif - movh [r2 + %2], m3 - movhps [r2 + %2 + 8], m3 -%endmacro - -%macro FILTER_W48_1 1 - FILTER_W8o_1 %1, 0 - FILTER_W8o_1 %1, 16 - FILTER_W8o_1 %1, 32 - FILTER_W8o_1 %1, 48 - FILTER_W8o_1 %1, 64 - FILTER_W8o_1 %1, 80 -%endmacro - -cglobal chroma_filter_pp_48x1_internal - FILTER_W48_1 pp - ret - -cglobal chroma_filter_ps_48x1_internal - FILTER_W48_1 ps - ret - -%macro FILTER_W64_1 1 - FILTER_W8o_1 %1, 0 - FILTER_W8o_1 %1, 16 - FILTER_W8o_1 %1, 32 - FILTER_W8o_1 %1, 48 - FILTER_W8o_1 %1, 64 - FILTER_W8o_1 %1, 80 - FILTER_W8o_1 %1, 96 - FILTER_W8o_1 %1, 112 -%endmacro - -cglobal chroma_filter_pp_64x1_internal - FILTER_W64_1 pp - ret - -cglobal chroma_filter_ps_64x1_internal - FILTER_W64_1 ps - ret -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- - -INIT_XMM sse4 -%macro IPFILTER_CHROMA 6 -cglobal interp_4tap_horiz_%3_%1x%2, 4, %5, %6 - - add r3, r3 - add r1, r1 - sub r0, 2 - mov r4d, r4m - add r4d, r4d - -%ifdef PIC - lea r%4, [tab_ChromaCoeff] - movh m0, [r%4 + r4 * 4] -%else - movh m0, [tab_ChromaCoeff + r4 * 4] -%endif - - punpcklqdq m0, m0 - mova m2, [tab_Tm16] - -%ifidn %3, ps - mova m1, [INTERP_OFFSET_PS] - cmp r5m, byte 0 - je .skip - sub r0, r1 - call chroma_filter_%3_%1x1_internal - add r0, r1 - add r2, r3 - call chroma_filter_%3_%1x1_internal - add r0, r1 - add r2, r3 - call chroma_filter_%3_%1x1_internal - add r0, r1 - add r2, r3 -.skip: -%else - mova m1, [tab_c_32] - pxor m6, m6 - mova m7, [pw_pixel_max] -%endif - - call chroma_filter_%3_%1x1_internal -%rep %2 - 1 - add r0, r1 - add r2, r3 - call chroma_filter_%3_%1x1_internal -%endrep -RET -%endmacro -IPFILTER_CHROMA 6, 8, pp, 5, 6, 8 -IPFILTER_CHROMA 8, 2, pp, 5, 6, 8 -IPFILTER_CHROMA 8, 4, pp, 5, 6, 8 -IPFILTER_CHROMA 8, 6, pp, 5, 6, 8 -IPFILTER_CHROMA 8, 8, pp, 5, 6, 8 -IPFILTER_CHROMA 8, 16, pp, 5, 6, 8 -IPFILTER_CHROMA 8, 32, pp, 5, 6, 8 -IPFILTER_CHROMA 12, 16, pp, 5, 6, 8 -IPFILTER_CHROMA 16, 4, pp, 5, 6, 8 -IPFILTER_CHROMA 16, 8, pp, 5, 6, 8 -IPFILTER_CHROMA 16, 12, pp, 5, 6, 8 -IPFILTER_CHROMA 16, 16, pp, 5, 6, 8 -IPFILTER_CHROMA 16, 32, pp, 5, 6, 8 -IPFILTER_CHROMA 24, 32, pp, 5, 6, 8 -IPFILTER_CHROMA 32, 8, pp, 5, 6, 8 -IPFILTER_CHROMA 32, 16, pp, 5, 6, 8 -IPFILTER_CHROMA 32, 24, pp, 5, 6, 8 -IPFILTER_CHROMA 32, 32, pp, 5, 6, 8 - -IPFILTER_CHROMA 6, 8, ps, 6, 7, 6 -IPFILTER_CHROMA 8, 2, ps, 6, 7, 6 -IPFILTER_CHROMA 8, 4, ps, 6, 7, 6 -IPFILTER_CHROMA 8, 6, ps, 6, 7, 6 -IPFILTER_CHROMA 8, 8, ps, 6, 7, 6 -IPFILTER_CHROMA 8, 16, ps, 6, 7, 6 -IPFILTER_CHROMA 8, 32, ps, 6, 7, 6 -IPFILTER_CHROMA 12, 16, ps, 6, 7, 6 -IPFILTER_CHROMA 16, 4, ps, 6, 7, 6 -IPFILTER_CHROMA 16, 8, ps, 6, 7, 6 -IPFILTER_CHROMA 16, 12, ps, 6, 7, 6 -IPFILTER_CHROMA 16, 16, ps, 6, 7, 6 -IPFILTER_CHROMA 16, 32, ps, 6, 7, 6 -IPFILTER_CHROMA 24, 32, ps, 6, 7, 6 -IPFILTER_CHROMA 32, 8, ps, 6, 7, 6 -IPFILTER_CHROMA 32, 16, ps, 6, 7, 6 -IPFILTER_CHROMA 32, 24, ps, 6, 7, 6 -IPFILTER_CHROMA 32, 32, ps, 6, 7, 6 - -IPFILTER_CHROMA 6, 16, pp, 5, 6, 8 -IPFILTER_CHROMA 8, 12, pp, 5, 6, 8 -IPFILTER_CHROMA 8, 64, pp, 5, 6, 8 -IPFILTER_CHROMA 12, 32, pp, 5, 6, 8 -IPFILTER_CHROMA 16, 24, pp, 5, 6, 8 -IPFILTER_CHROMA 16, 64, pp, 5, 6, 8 -IPFILTER_CHROMA 24, 64, pp, 5, 6, 8 -IPFILTER_CHROMA 32, 48, pp, 5, 6, 8 -IPFILTER_CHROMA 32, 64, pp, 5, 6, 8 -IPFILTER_CHROMA 6, 16, ps, 6, 7, 6 -IPFILTER_CHROMA 8, 12, ps, 6, 7, 6 -IPFILTER_CHROMA 8, 64, ps, 6, 7, 6 -IPFILTER_CHROMA 12, 32, ps, 6, 7, 6 -IPFILTER_CHROMA 16, 24, ps, 6, 7, 6 -IPFILTER_CHROMA 16, 64, ps, 6, 7, 6 -IPFILTER_CHROMA 24, 64, ps, 6, 7, 6 -IPFILTER_CHROMA 32, 48, ps, 6, 7, 6 -IPFILTER_CHROMA 32, 64, ps, 6, 7, 6 - -IPFILTER_CHROMA 48, 64, pp, 5, 6, 8 -IPFILTER_CHROMA 64, 48, pp, 5, 6, 8 -IPFILTER_CHROMA 64, 64, pp, 5, 6, 8 -IPFILTER_CHROMA 64, 32, pp, 5, 6, 8 -IPFILTER_CHROMA 64, 16, pp, 5, 6, 8 -IPFILTER_CHROMA 48, 64, ps, 6, 7, 6 -IPFILTER_CHROMA 64, 48, ps, 6, 7, 6 -IPFILTER_CHROMA 64, 64, ps, 6, 7, 6 -IPFILTER_CHROMA 64, 32, ps, 6, 7, 6 -IPFILTER_CHROMA 64, 16, ps, 6, 7, 6 - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -%macro IPFILTER_CHROMA_avx2_6xN 1 -cglobal interp_4tap_horiz_pp_6x%1, 5,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 2 - mov r4d, r4m -%ifdef PIC - lea r5, [tab_ChromaCoeff] - vpbroadcastq m0, [r5 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m1, [h4_interp8_hpp_shuf] - vpbroadcastd m2, [pd_32] - pxor m5, m5 - mova m6, [idct8_shuf2] - mova m7, [pw_pixel_max] - - mov r4d, %1/2 -.loop: - vbroadcasti128 m3, [r0] - vbroadcasti128 m4, [r0 + 8] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, INTERP_SHIFT_PP ; m3 = DWORD[7 6 3 2 5 4 1 0] - - packusdw m3, m3 - vpermq m3, m3, q2020 - pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] - CLIPW xm3, xm5, xm7 - movq [r2], xm3 - pextrd [r2 + 8], xm3, 2 - - vbroadcasti128 m3, [r0 + r1] - vbroadcasti128 m4, [r0 + r1 + 8] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, INTERP_SHIFT_PP ; m3 = DWORD[7 6 3 2 5 4 1 0] - - packusdw m3, m3 - vpermq m3, m3, q2020 - pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] - CLIPW xm3, xm5, xm7 - movq [r2 + r3], xm3 - pextrd [r2 + r3 + 8], xm3, 2 - - lea r0, [r0 + r1 * 2] - lea r2, [r2 + r3 * 2] - dec r4d - jnz .loop - RET -%endmacro -IPFILTER_CHROMA_avx2_6xN 8 -IPFILTER_CHROMA_avx2_6xN 16 - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_8x2, 5,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 2 - mov r4d, r4m -%ifdef PIC - lea r5, [tab_ChromaCoeff] - vpbroadcastq m0, [r5 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m1, [h4_interp8_hpp_shuf] - vpbroadcastd m2, [pd_32] - pxor m5, m5 - mova m6, [idct8_shuf2] - mova m7, [pw_pixel_max] - - vbroadcasti128 m3, [r0] - vbroadcasti128 m4, [r0 + 8] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, INTERP_SHIFT_PP ; m3 = DWORD[7 6 3 2 5 4 1 0] - - packusdw m3, m3 - vpermq m3, m3,q2020 - pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] - CLIPW xm3, xm5, xm7 - movu [r2], xm3 - - vbroadcasti128 m3, [r0 + r1] - vbroadcasti128 m4, [r0 + r1 + 8] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, INTERP_SHIFT_PP ; m3 = DWORD[7 6 3 2 5 4 1 0] - - packusdw m3, m3 - vpermq m3, m3,q2020 - pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] - CLIPW xm3, xm5, xm7 - movu [r2 + r3], xm3 - RET - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -cglobal interp_4tap_horiz_pp_8x4, 5,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 2 - mov r4d, r4m -%ifdef PIC - lea r5, [tab_ChromaCoeff] - vpbroadcastq m0, [r5 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m1, [h4_interp8_hpp_shuf] - vpbroadcastd m2, [pd_32] - pxor m5, m5 - mova m6, [idct8_shuf2] - mova m7, [pw_pixel_max] - -%rep 2 - vbroadcasti128 m3, [r0] - vbroadcasti128 m4, [r0 + 8] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] - - packusdw m3, m3 - vpermq m3, m3,q2020 - pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] - CLIPW xm3, xm5, xm7 - movu [r2], xm3 - - vbroadcasti128 m3, [r0 + r1] - vbroadcasti128 m4, [r0 + r1 + 8] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] - - packusdw m3, m3 - vpermq m3, m3,q2020 - pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] - CLIPW xm3, xm5, xm7 - movu [r2 + r3], xm3 - - lea r0, [r0 + r1 * 2] - lea r2, [r2 + r3 * 2] -%endrep - RET - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -%macro IPFILTER_CHROMA_avx2_8xN 1 -cglobal interp_4tap_horiz_pp_8x%1, 5,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 2 - mov r4d, r4m -%ifdef PIC - lea r5, [tab_ChromaCoeff] - vpbroadcastq m0, [r5 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m1, [h4_interp8_hpp_shuf] - vpbroadcastd m2, [pd_32] - pxor m5, m5 - mova m6, [idct8_shuf2] - mova m7, [pw_pixel_max] - - mov r4d, %1/2 -.loop: - vbroadcasti128 m3, [r0] - vbroadcasti128 m4, [r0 + 8] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] - - packusdw m3, m3 - vpermq m3, m3, q2020 - pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] - CLIPW xm3, xm5, xm7 - movu [r2], xm3 - - vbroadcasti128 m3, [r0 + r1] - vbroadcasti128 m4, [r0 + r1 + 8] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] - - packusdw m3, m3 - vpermq m3, m3, q2020 - pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] - CLIPW xm3, xm5, xm7 - movu [r2 + r3], xm3 - - lea r0, [r0 + r1 * 2] - lea r2, [r2 + r3 * 2] - dec r4d - jnz .loop - RET -%endmacro -IPFILTER_CHROMA_avx2_8xN 6 -IPFILTER_CHROMA_avx2_8xN 8 -IPFILTER_CHROMA_avx2_8xN 12 -IPFILTER_CHROMA_avx2_8xN 16 -IPFILTER_CHROMA_avx2_8xN 32 -IPFILTER_CHROMA_avx2_8xN 64 - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -%macro IPFILTER_CHROMA_avx2_16xN 1 -%if ARCH_X86_64 -cglobal interp_4tap_horiz_pp_16x%1, 5,6,9 - add r1d, r1d - add r3d, r3d - sub r0, 2 - mov r4d, r4m -%ifdef PIC - lea r5, [tab_ChromaCoeff] - vpbroadcastq m0, [r5 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m1, [h4_interp8_hpp_shuf] - vpbroadcastd m2, [pd_32] - pxor m5, m5 - mova m6, [idct8_shuf2] - mova m7, [pw_pixel_max] - - mov r4d, %1 -.loop: - vbroadcasti128 m3, [r0] - vbroadcasti128 m4, [r0 + 8] - - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] - - packusdw m3, m3 - vpermq m3, m3, q2020 - pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] - - vbroadcasti128 m4, [r0 + 16] - vbroadcasti128 m8, [r0 + 24] - - pshufb m4, m1 - pshufb m8, m1 - - pmaddwd m4, m0 - pmaddwd m8, m0 - phaddd m4, m8 - paddd m4, m2 - psrad m4, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] - - packusdw m4, m4 - vpermq m4, m4, q2020 - pshufb xm4, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] - vinserti128 m3, m3, xm4, 1 - CLIPW m3, m5, m7 - movu [r2], m3 - - add r0, r1 - add r2, r3 - dec r4d - jnz .loop - RET -%endif -%endmacro -IPFILTER_CHROMA_avx2_16xN 4 -IPFILTER_CHROMA_avx2_16xN 8 -IPFILTER_CHROMA_avx2_16xN 12 -IPFILTER_CHROMA_avx2_16xN 16 -IPFILTER_CHROMA_avx2_16xN 24 -IPFILTER_CHROMA_avx2_16xN 32 -IPFILTER_CHROMA_avx2_16xN 64 - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -%macro IPFILTER_CHROMA_avx2_32xN 1 -%if ARCH_X86_64 -cglobal interp_4tap_horiz_pp_32x%1, 5,6,9 - add r1d, r1d - add r3d, r3d - sub r0, 2 - mov r4d, r4m -%ifdef PIC - lea r5, [tab_ChromaCoeff] - vpbroadcastq m0, [r5 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m1, [h4_interp8_hpp_shuf] - vpbroadcastd m2, [pd_32] - pxor m5, m5 - mova m6, [idct8_shuf2] - mova m7, [pw_pixel_max] - - mov r6d, %1 -.loop: -%assign x 0 -%rep 2 - vbroadcasti128 m3, [r0 + x] - vbroadcasti128 m4, [r0 + 8 + x] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] - - packusdw m3, m3 - vpermq m3, m3, q2020 - pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] - - vbroadcasti128 m4, [r0 + 16 + x] - vbroadcasti128 m8, [r0 + 24 + x] - pshufb m4, m1 - pshufb m8, m1 - - pmaddwd m4, m0 - pmaddwd m8, m0 - phaddd m4, m8 - paddd m4, m2 - psrad m4, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] - - packusdw m4, m4 - vpermq m4, m4, q2020 - pshufb xm4, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] - vinserti128 m3, m3, xm4, 1 - CLIPW m3, m5, m7 - movu [r2 + x], m3 - %assign x x+32 - %endrep - - add r0, r1 - add r2, r3 - dec r6d - jnz .loop - RET -%endif -%endmacro -IPFILTER_CHROMA_avx2_32xN 8 -IPFILTER_CHROMA_avx2_32xN 16 -IPFILTER_CHROMA_avx2_32xN 24 -IPFILTER_CHROMA_avx2_32xN 32 -IPFILTER_CHROMA_avx2_32xN 48 -IPFILTER_CHROMA_avx2_32xN 64 - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -%macro IPFILTER_CHROMA_avx2_12xN 1 -%if ARCH_X86_64 -cglobal interp_4tap_horiz_pp_12x%1, 5,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 2 - mov r4d, r4m -%ifdef PIC - lea r5, [tab_ChromaCoeff] - vpbroadcastq m0, [r5 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m1, [h4_interp8_hpp_shuf] - vpbroadcastd m2, [pd_32] - pxor m5, m5 - mova m6, [idct8_shuf2] - mova m7, [pw_pixel_max] - - mov r4d, %1 -.loop: - vbroadcasti128 m3, [r0] - vbroadcasti128 m4, [r0 + 8] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] - - packusdw m3, m3 - vpermq m3, m3, q2020 - pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] - CLIPW xm3, xm5, xm7 - movu [r2], xm3 - - vbroadcasti128 m3, [r0 + 16] - vbroadcasti128 m4, [r0 + 24] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] - - packusdw m3, m3 - vpermq m3, m3, q2020 - pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] - CLIPW xm3, xm5, xm7 - movq [r2 + 16], xm3 - - add r0, r1 - add r2, r3 - dec r4d - jnz .loop - RET -%endif -%endmacro -IPFILTER_CHROMA_avx2_12xN 16 -IPFILTER_CHROMA_avx2_12xN 32 - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -%macro IPFILTER_CHROMA_avx2_24xN 1 -%if ARCH_X86_64 -cglobal interp_4tap_horiz_pp_24x%1, 5,6,9 - add r1d, r1d - add r3d, r3d - sub r0, 2 - mov r4d, r4m -%ifdef PIC - lea r5, [tab_ChromaCoeff] - vpbroadcastq m0, [r5 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m1, [h4_interp8_hpp_shuf] - vpbroadcastd m2, [pd_32] - pxor m5, m5 - mova m6, [idct8_shuf2] - mova m7, [pw_pixel_max] - - mov r4d, %1 -.loop: - vbroadcasti128 m3, [r0] - vbroadcasti128 m4, [r0 + 8] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, 6 - - vbroadcasti128 m4, [r0 + 16] - vbroadcasti128 m8, [r0 + 24] - pshufb m4, m1 - pshufb m8, m1 - - pmaddwd m4, m0 - pmaddwd m8, m0 - phaddd m4, m8 - paddd m4, m2 - psrad m4, 6 - - packusdw m3, m4 - vpermq m3, m3, q3120 - pshufb m3, m6 - CLIPW m3, m5, m7 - movu [r2], m3 - - vbroadcasti128 m3, [r0 + 32] - vbroadcasti128 m4, [r0 + 40] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, 6 - - packusdw m3, m3 - vpermq m3, m3, q2020 - pshufb xm3, xm6 - CLIPW xm3, xm5, xm7 - movu [r2 + 32], xm3 - - add r0, r1 - add r2, r3 - dec r4d - jnz .loop - RET -%endif -%endmacro -IPFILTER_CHROMA_avx2_24xN 32 -IPFILTER_CHROMA_avx2_24xN 64 - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -%macro IPFILTER_CHROMA_avx2_64xN 1 -%if ARCH_X86_64 -cglobal interp_4tap_horiz_pp_64x%1, 5,6,9 - add r1d, r1d - add r3d, r3d - sub r0, 2 - mov r4d, r4m -%ifdef PIC - lea r5, [tab_ChromaCoeff] - vpbroadcastq m0, [r5 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m1, [h4_interp8_hpp_shuf] - vpbroadcastd m2, [pd_32] - pxor m5, m5 - mova m6, [idct8_shuf2] - mova m7, [pw_pixel_max] - - mov r6d, %1 -.loop: -%assign x 0 -%rep 4 - vbroadcasti128 m3, [r0 + x] - vbroadcasti128 m4, [r0 + 8 + x] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, 6 - - vbroadcasti128 m4, [r0 + 16 + x] - vbroadcasti128 m8, [r0 + 24 + x] - pshufb m4, m1 - pshufb m8, m1 - - pmaddwd m4, m0 - pmaddwd m8, m0 - phaddd m4, m8 - paddd m4, m2 - psrad m4, 6 - - packusdw m3, m4 - vpermq m3, m3, q3120 - pshufb m3, m6 - CLIPW m3, m5, m7 - movu [r2 + x], m3 - %assign x x+32 - %endrep - - add r0, r1 - add r2, r3 - dec r6d - jnz .loop - RET -%endif -%endmacro -IPFILTER_CHROMA_avx2_64xN 16 -IPFILTER_CHROMA_avx2_64xN 32 -IPFILTER_CHROMA_avx2_64xN 48 -IPFILTER_CHROMA_avx2_64xN 64 - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx -;------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 -%if ARCH_X86_64 -cglobal interp_4tap_horiz_pp_48x64, 5,6,9 - add r1d, r1d - add r3d, r3d - sub r0, 2 - mov r4d, r4m -%ifdef PIC - lea r5, [tab_ChromaCoeff] - vpbroadcastq m0, [r5 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m1, [h4_interp8_hpp_shuf] - vpbroadcastd m2, [pd_32] - pxor m5, m5 - mova m6, [idct8_shuf2] - mova m7, [pw_pixel_max] - - mov r4d, 64 -.loop: -%assign x 0 -%rep 3 - vbroadcasti128 m3, [r0 + x] - vbroadcasti128 m4, [r0 + 8 + x] - pshufb m3, m1 - pshufb m4, m1 - - pmaddwd m3, m0 - pmaddwd m4, m0 - phaddd m3, m4 - paddd m3, m2 - psrad m3, 6 - - vbroadcasti128 m4, [r0 + 16 + x] - vbroadcasti128 m8, [r0 + 24 + x] - pshufb m4, m1 - pshufb m8, m1 - - pmaddwd m4, m0 - pmaddwd m8, m0 - phaddd m4, m8 - paddd m4, m2 - psrad m4, 6 - - packusdw m3, m4 - vpermq m3, m3, q3120 - pshufb m3, m6 - CLIPW m3, m5, m7 - movu [r2 + x], m3 -%assign x x+32 -%endrep - - add r0, r1 - add r2, r3 - dec r4d - jnz .loop - RET -%endif - -%macro IPFILTER_CHROMA_PS_8xN_AVX2 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_horiz_ps_8x%1, 4, 7, 6 - add r1d, r1d - add r3d, r3d - mov r4d, r4m - mov r5d, r5m - -%ifdef PIC - lea r6, [tab_ChromaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m3, [h4_interp8_hpp_shuf] - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 , m1 interpolate coeff - - sub r0, 2 - test r5d, r5d - mov r4d, %1 - jz .loop0 - sub r0, r1 - add r4d, 3 - -.loop0: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2], xm4 - add r2, r3 - add r0, r1 - dec r4d - jnz .loop0 - RET -%endif -%endmacro - - IPFILTER_CHROMA_PS_8xN_AVX2 4 - IPFILTER_CHROMA_PS_8xN_AVX2 8 - IPFILTER_CHROMA_PS_8xN_AVX2 16 - IPFILTER_CHROMA_PS_8xN_AVX2 32 - IPFILTER_CHROMA_PS_8xN_AVX2 6 - IPFILTER_CHROMA_PS_8xN_AVX2 2 - IPFILTER_CHROMA_PS_8xN_AVX2 12 - IPFILTER_CHROMA_PS_8xN_AVX2 64 - -%macro IPFILTER_CHROMA_PS_16xN_AVX2 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_horiz_ps_16x%1, 4, 7, 6 - add r1d, r1d - add r3d, r3d - mov r4d, r4m - mov r5d, r5m - -%ifdef PIC - lea r6, [tab_ChromaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m3, [h4_interp8_hpp_shuf] - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 , m1 interpolate coeff - - sub r0, 2 - test r5d, r5d - mov r4d, %1 - jz .loop0 - sub r0, r1 - add r4d, 3 - -.loop0: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2], xm4 - - vbroadcasti128 m4, [r0 + 16] - vbroadcasti128 m5, [r0 + 24] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 16], xm4 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop0 - RET -%endif -%endmacro - -IPFILTER_CHROMA_PS_16xN_AVX2 16 -IPFILTER_CHROMA_PS_16xN_AVX2 8 -IPFILTER_CHROMA_PS_16xN_AVX2 32 -IPFILTER_CHROMA_PS_16xN_AVX2 12 -IPFILTER_CHROMA_PS_16xN_AVX2 4 -IPFILTER_CHROMA_PS_16xN_AVX2 64 -IPFILTER_CHROMA_PS_16xN_AVX2 24 - -%macro IPFILTER_CHROMA_PS_24xN_AVX2 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_horiz_ps_24x%1, 4, 7, 6 - add r1d, r1d - add r3d, r3d - mov r4d, r4m - mov r5d, r5m - -%ifdef PIC - lea r6, [tab_ChromaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m3, [h4_interp8_hpp_shuf] - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 , m1 interpolate coeff - - sub r0, 2 - test r5d, r5d - mov r4d, %1 - jz .loop0 - sub r0, r1 - add r4d, 3 - -.loop0: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2], xm4 - - vbroadcasti128 m4, [r0 + 16] - vbroadcasti128 m5, [r0 + 24] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 16], xm4 - - vbroadcasti128 m4, [r0 + 32] - vbroadcasti128 m5, [r0 + 40] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 32], xm4 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop0 - RET -%endif -%endmacro - -IPFILTER_CHROMA_PS_24xN_AVX2 32 -IPFILTER_CHROMA_PS_24xN_AVX2 64 - -%macro IPFILTER_CHROMA_PS_12xN_AVX2 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_horiz_ps_12x%1, 4, 7, 6 - add r1d, r1d - add r3d, r3d - mov r4d, r4m - mov r5d, r5m - -%ifdef PIC - lea r6, [tab_ChromaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m3, [h4_interp8_hpp_shuf] - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 , m1 interpolate coeff - - sub r0, 2 - test r5d, r5d - mov r4d, %1 - jz .loop0 - sub r0, r1 - add r4d, 3 - -.loop0: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2], xm4 - - vbroadcasti128 m4, [r0 + 16] - pshufb m4, m3 - pmaddwd m4, m0 - phaddd m4, m4 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movq [r2 + 16], xm4 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop0 - RET -%endif -%endmacro - -IPFILTER_CHROMA_PS_12xN_AVX2 16 -IPFILTER_CHROMA_PS_12xN_AVX2 32 - -%macro IPFILTER_CHROMA_PS_32xN_AVX2 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_horiz_ps_32x%1, 4, 7, 6 - add r1d, r1d - add r3d, r3d - mov r4d, r4m - mov r5d, r5m - -%ifdef PIC - lea r6, [tab_ChromaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m3, [h4_interp8_hpp_shuf] - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 , m1 interpolate coeff - - sub r0, 2 - test r5d, r5d - mov r4d, %1 - jz .loop0 - sub r0, r1 - add r4d, 3 - -.loop0: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2], xm4 - - vbroadcasti128 m4, [r0 + 16] - vbroadcasti128 m5, [r0 + 24] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 16], xm4 - - vbroadcasti128 m4, [r0 + 32] - vbroadcasti128 m5, [r0 + 40] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 32], xm4 - - vbroadcasti128 m4, [r0 + 48] - vbroadcasti128 m5, [r0 + 56] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 48], xm4 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop0 - RET -%endif -%endmacro - -IPFILTER_CHROMA_PS_32xN_AVX2 32 -IPFILTER_CHROMA_PS_32xN_AVX2 16 -IPFILTER_CHROMA_PS_32xN_AVX2 24 -IPFILTER_CHROMA_PS_32xN_AVX2 8 -IPFILTER_CHROMA_PS_32xN_AVX2 64 -IPFILTER_CHROMA_PS_32xN_AVX2 48 - - -%macro IPFILTER_CHROMA_PS_64xN_AVX2 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_horiz_ps_64x%1, 4, 7, 6 - add r1d, r1d - add r3d, r3d - mov r4d, r4m - mov r5d, r5m - -%ifdef PIC - lea r6, [tab_ChromaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m3, [h4_interp8_hpp_shuf] - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 , m1 interpolate coeff - - sub r0, 2 - test r5d, r5d - mov r4d, %1 - jz .loop0 - sub r0, r1 - add r4d, 3 - -.loop0: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2], xm4 - - vbroadcasti128 m4, [r0 + 16] - vbroadcasti128 m5, [r0 + 24] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 16], xm4 - - vbroadcasti128 m4, [r0 + 32] - vbroadcasti128 m5, [r0 + 40] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 32], xm4 - - vbroadcasti128 m4, [r0 + 48] - vbroadcasti128 m5, [r0 + 56] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 48], xm4 - - vbroadcasti128 m4, [r0 + 64] - vbroadcasti128 m5, [r0 + 72] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 64], xm4 - - vbroadcasti128 m4, [r0 + 80] - vbroadcasti128 m5, [r0 + 88] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 80], xm4 - - vbroadcasti128 m4, [r0 + 96] - vbroadcasti128 m5, [r0 + 104] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 96], xm4 - - vbroadcasti128 m4, [r0 + 112] - vbroadcasti128 m5, [r0 + 120] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 112], xm4 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop0 - RET -%endif -%endmacro - -IPFILTER_CHROMA_PS_64xN_AVX2 64 -IPFILTER_CHROMA_PS_64xN_AVX2 48 -IPFILTER_CHROMA_PS_64xN_AVX2 32 -IPFILTER_CHROMA_PS_64xN_AVX2 16 - -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_horiz_ps_48x64, 4, 7, 6 - add r1d, r1d - add r3d, r3d - mov r4d, r4m - mov r5d, r5m - -%ifdef PIC - lea r6, [tab_ChromaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m3, [h4_interp8_hpp_shuf] - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 , m1 interpolate coeff - - sub r0, 2 - test r5d, r5d - mov r4d, 64 - jz .loop0 - sub r0, r1 - add r4d, 3 - -.loop0: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2], xm4 - - vbroadcasti128 m4, [r0 + 16] - vbroadcasti128 m5, [r0 + 24] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 16], xm4 - - vbroadcasti128 m4, [r0 + 32] - vbroadcasti128 m5, [r0 + 40] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 32], xm4 - - vbroadcasti128 m4, [r0 + 48] - vbroadcasti128 m5, [r0 + 56] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 48], xm4 - - vbroadcasti128 m4, [r0 + 64] - vbroadcasti128 m5, [r0 + 72] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 64], xm4 - - vbroadcasti128 m4, [r0 + 80] - vbroadcasti128 m5, [r0 + 88] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movu [r2 + 80], xm4 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop0 - RET -%endif - -%macro IPFILTER_CHROMA_PS_6xN_AVX2 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_horiz_ps_6x%1, 4, 7, 6 - add r1d, r1d - add r3d, r3d - mov r4d, r4m - mov r5d, r5m - -%ifdef PIC - lea r6, [tab_ChromaCoeff] - vpbroadcastq m0, [r6 + r4 * 8] -%else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] -%endif - mova m3, [h4_interp8_hpp_shuf] - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 , m1 interpolate coeff - - sub r0, 2 - test r5d, r5d - mov r4d, %1 - jz .loop0 - sub r0, r1 - add r4d, 3 - -.loop0: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - pmaddwd m4, m0 - pmaddwd m5, m0 - phaddd m4, m5 - paddd m4, m2 - vpermq m4, m4, q3120 - psrad m4, INTERP_SHIFT_PS - vextracti128 xm5, m4, 1 - packssdw xm4, xm5 - movq [r2], xm4 - pextrd [r2 + 8], xm4, 2 - add r2, r3 - add r0, r1 - dec r4d - jnz .loop0 - RET -%endif -%endmacro - - IPFILTER_CHROMA_PS_6xN_AVX2 8 - IPFILTER_CHROMA_PS_6xN_AVX2 16
View file
x265_2.7.tar.gz/source/common/x86/v4-ipfilter16.asm
Deleted
@@ -1,3529 +0,0 @@ -;***************************************************************************** -;* Copyright (C) 2013-2017 MulticoreWare, Inc -;* -;* Authors: Nabajit Deka <nabajit@multicorewareinc.com> -;* Murugan Vairavel <murugan@multicorewareinc.com> -;* Min Chen <chenm003@163.com> -;* -;* This program is free software; you can redistribute it and/or modify -;* it under the terms of the GNU General Public License as published by -;* the Free Software Foundation; either version 2 of the License, or -;* (at your option) any later version. -;* -;* This program is distributed in the hope that it will be useful, -;* but WITHOUT ANY WARRANTY; without even the implied warranty of -;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -;* GNU General Public License for more details. -;* -;* You should have received a copy of the GNU General Public License -;* along with this program; if not, write to the Free Software -;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. -;* -;* This program is also available under a commercial proprietary license. -;* For more information, contact us at license @ x265.com. -;*****************************************************************************/ - -%include "x86inc.asm" -%include "x86util.asm" - - -%define INTERP_OFFSET_PP pd_32 -%define INTERP_SHIFT_PP 6 - -%if BIT_DEPTH == 10 - %define INTERP_SHIFT_PS 2 - %define INTERP_OFFSET_PS pd_n32768 - %define INTERP_SHIFT_SP 10 - %define INTERP_OFFSET_SP v4_pd_524800 -%elif BIT_DEPTH == 12 - %define INTERP_SHIFT_PS 4 - %define INTERP_OFFSET_PS pd_n131072 - %define INTERP_SHIFT_SP 8 - %define INTERP_OFFSET_SP pd_524416 -%else - %error Unsupport bit depth! -%endif - - -SECTION_RODATA 32 - -v4_pd_524800: times 8 dd 524800 -tab_c_n8192: times 8 dw -8192 - -const tab_ChromaCoeffV, times 8 dw 0, 64 - times 8 dw 0, 0 - - times 8 dw -2, 58 - times 8 dw 10, -2 - - times 8 dw -4, 54 - times 8 dw 16, -2 - - times 8 dw -6, 46 - times 8 dw 28, -4 - - times 8 dw -4, 36 - times 8 dw 36, -4 - - times 8 dw -4, 28 - times 8 dw 46, -6 - - times 8 dw -2, 16 - times 8 dw 54, -4 - - times 8 dw -2, 10 - times 8 dw 58, -2 - -tab_ChromaCoeffVer: times 8 dw 0, 64 - times 8 dw 0, 0 - - times 8 dw -2, 58 - times 8 dw 10, -2 - - times 8 dw -4, 54 - times 8 dw 16, -2 - - times 8 dw -6, 46 - times 8 dw 28, -4 - - times 8 dw -4, 36 - times 8 dw 36, -4 - - times 8 dw -4, 28 - times 8 dw 46, -6 - - times 8 dw -2, 16 - times 8 dw 54, -4 - - times 8 dw -2, 10 - times 8 dw 58, -2 - -SECTION .text -cextern pd_8 -cextern pd_32 -cextern pw_pixel_max -cextern pd_524416 -cextern pd_n32768 -cextern pd_n131072 -cextern pw_2000 -cextern idct8_shuf2 - -%macro PROCESS_CHROMA_SP_W4_4R 0 - movq m0, [r0] - movq m1, [r0 + r1] - punpcklwd m0, m1 ;m0=[0 1] - pmaddwd m0, [r6 + 0 *32] ;m0=[0+1] Row1 - - lea r0, [r0 + 2 * r1] - movq m4, [r0] - punpcklwd m1, m4 ;m1=[1 2] - pmaddwd m1, [r6 + 0 *32] ;m1=[1+2] Row2 - - movq m5, [r0 + r1] - punpcklwd m4, m5 ;m4=[2 3] - pmaddwd m2, m4, [r6 + 0 *32] ;m2=[2+3] Row3 - pmaddwd m4, [r6 + 1 * 32] - paddd m0, m4 ;m0=[0+1+2+3] Row1 done - - lea r0, [r0 + 2 * r1] - movq m4, [r0] - punpcklwd m5, m4 ;m5=[3 4] - pmaddwd m3, m5, [r6 + 0 *32] ;m3=[3+4] Row4 - pmaddwd m5, [r6 + 1 * 32] - paddd m1, m5 ;m1 = [1+2+3+4] Row2 - - movq m5, [r0 + r1] - punpcklwd m4, m5 ;m4=[4 5] - pmaddwd m4, [r6 + 1 * 32] - paddd m2, m4 ;m2=[2+3+4+5] Row3 - - movq m4, [r0 + 2 * r1] - punpcklwd m5, m4 ;m5=[5 6] - pmaddwd m5, [r6 + 1 * 32] - paddd m3, m5 ;m3=[3+4+5+6] Row4 -%endmacro - -;----------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_%3_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_SS 4 -INIT_XMM sse2 -cglobal interp_4tap_vert_%3_%1x%2, 5, 7, %4 ,0-gprsize - - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r6, [r5 + r4] -%else - lea r6, [tab_ChromaCoeffV + r4] -%endif - - mov dword [rsp], %2/4 - -%ifnidn %3, ss - %ifnidn %3, ps - mova m7, [pw_pixel_max] - %ifidn %3, pp - mova m6, [INTERP_OFFSET_PP] - %else - mova m6, [INTERP_OFFSET_SP] - %endif - %else - mova m6, [INTERP_OFFSET_PS] - %endif -%endif - -.loopH: - mov r4d, (%1/4) -.loopW: - PROCESS_CHROMA_SP_W4_4R - -%ifidn %3, ss - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - - packssdw m0, m1 - packssdw m2, m3 -%elifidn %3, ps - paddd m0, m6 - paddd m1, m6 - paddd m2, m6 - paddd m3, m6 - psrad m0, INTERP_SHIFT_PS - psrad m1, INTERP_SHIFT_PS - psrad m2, INTERP_SHIFT_PS - psrad m3, INTERP_SHIFT_PS - - packssdw m0, m1 - packssdw m2, m3 -%else - paddd m0, m6 - paddd m1, m6 - paddd m2, m6 - paddd m3, m6 - %ifidn %3, pp - psrad m0, INTERP_SHIFT_PP - psrad m1, INTERP_SHIFT_PP - psrad m2, INTERP_SHIFT_PP - psrad m3, INTERP_SHIFT_PP - %else - psrad m0, INTERP_SHIFT_SP - psrad m1, INTERP_SHIFT_SP - psrad m2, INTERP_SHIFT_SP - psrad m3, INTERP_SHIFT_SP - %endif - packssdw m0, m1 - packssdw m2, m3 - pxor m5, m5 - CLIPW2 m0, m2, m5, m7 -%endif - - movh [r2], m0 - movhps [r2 + r3], m0 - lea r5, [r2 + 2 * r3] - movh [r5], m2 - movhps [r5 + r3], m2 - - lea r5, [4 * r1 - 2 * 4] - sub r0, r5 - add r2, 2 * 4 - - dec r4d - jnz .loopW - - lea r0, [r0 + 4 * r1 - 2 * %1] - lea r2, [r2 + 4 * r3 - 2 * %1] - - dec dword [rsp] - jnz .loopH - - RET -%endmacro - - FILTER_VER_CHROMA_SS 4, 4, ss, 6 - FILTER_VER_CHROMA_SS 4, 8, ss, 6 - FILTER_VER_CHROMA_SS 16, 16, ss, 6 - FILTER_VER_CHROMA_SS 16, 8, ss, 6 - FILTER_VER_CHROMA_SS 16, 12, ss, 6 - FILTER_VER_CHROMA_SS 12, 16, ss, 6 - FILTER_VER_CHROMA_SS 16, 4, ss, 6 - FILTER_VER_CHROMA_SS 4, 16, ss, 6 - FILTER_VER_CHROMA_SS 32, 32, ss, 6 - FILTER_VER_CHROMA_SS 32, 16, ss, 6 - FILTER_VER_CHROMA_SS 16, 32, ss, 6 - FILTER_VER_CHROMA_SS 32, 24, ss, 6 - FILTER_VER_CHROMA_SS 24, 32, ss, 6 - FILTER_VER_CHROMA_SS 32, 8, ss, 6 - - FILTER_VER_CHROMA_SS 4, 4, ps, 7 - FILTER_VER_CHROMA_SS 4, 8, ps, 7 - FILTER_VER_CHROMA_SS 16, 16, ps, 7 - FILTER_VER_CHROMA_SS 16, 8, ps, 7 - FILTER_VER_CHROMA_SS 16, 12, ps, 7 - FILTER_VER_CHROMA_SS 12, 16, ps, 7 - FILTER_VER_CHROMA_SS 16, 4, ps, 7 - FILTER_VER_CHROMA_SS 4, 16, ps, 7 - FILTER_VER_CHROMA_SS 32, 32, ps, 7 - FILTER_VER_CHROMA_SS 32, 16, ps, 7 - FILTER_VER_CHROMA_SS 16, 32, ps, 7 - FILTER_VER_CHROMA_SS 32, 24, ps, 7 - FILTER_VER_CHROMA_SS 24, 32, ps, 7 - FILTER_VER_CHROMA_SS 32, 8, ps, 7 - - FILTER_VER_CHROMA_SS 4, 4, sp, 8 - FILTER_VER_CHROMA_SS 4, 8, sp, 8 - FILTER_VER_CHROMA_SS 16, 16, sp, 8 - FILTER_VER_CHROMA_SS 16, 8, sp, 8 - FILTER_VER_CHROMA_SS 16, 12, sp, 8 - FILTER_VER_CHROMA_SS 12, 16, sp, 8 - FILTER_VER_CHROMA_SS 16, 4, sp, 8 - FILTER_VER_CHROMA_SS 4, 16, sp, 8 - FILTER_VER_CHROMA_SS 32, 32, sp, 8 - FILTER_VER_CHROMA_SS 32, 16, sp, 8 - FILTER_VER_CHROMA_SS 16, 32, sp, 8 - FILTER_VER_CHROMA_SS 32, 24, sp, 8 - FILTER_VER_CHROMA_SS 24, 32, sp, 8 - FILTER_VER_CHROMA_SS 32, 8, sp, 8 - - FILTER_VER_CHROMA_SS 4, 4, pp, 8 - FILTER_VER_CHROMA_SS 4, 8, pp, 8 - FILTER_VER_CHROMA_SS 16, 16, pp, 8 - FILTER_VER_CHROMA_SS 16, 8, pp, 8 - FILTER_VER_CHROMA_SS 16, 12, pp, 8 - FILTER_VER_CHROMA_SS 12, 16, pp, 8 - FILTER_VER_CHROMA_SS 16, 4, pp, 8 - FILTER_VER_CHROMA_SS 4, 16, pp, 8 - FILTER_VER_CHROMA_SS 32, 32, pp, 8 - FILTER_VER_CHROMA_SS 32, 16, pp, 8 - FILTER_VER_CHROMA_SS 16, 32, pp, 8 - FILTER_VER_CHROMA_SS 32, 24, pp, 8 - FILTER_VER_CHROMA_SS 24, 32, pp, 8 - FILTER_VER_CHROMA_SS 32, 8, pp, 8 - - - FILTER_VER_CHROMA_SS 16, 24, ss, 6 - FILTER_VER_CHROMA_SS 12, 32, ss, 6 - FILTER_VER_CHROMA_SS 4, 32, ss, 6 - FILTER_VER_CHROMA_SS 32, 64, ss, 6 - FILTER_VER_CHROMA_SS 16, 64, ss, 6 - FILTER_VER_CHROMA_SS 32, 48, ss, 6 - FILTER_VER_CHROMA_SS 24, 64, ss, 6 - - FILTER_VER_CHROMA_SS 16, 24, ps, 7 - FILTER_VER_CHROMA_SS 12, 32, ps, 7 - FILTER_VER_CHROMA_SS 4, 32, ps, 7 - FILTER_VER_CHROMA_SS 32, 64, ps, 7 - FILTER_VER_CHROMA_SS 16, 64, ps, 7 - FILTER_VER_CHROMA_SS 32, 48, ps, 7 - FILTER_VER_CHROMA_SS 24, 64, ps, 7 - - FILTER_VER_CHROMA_SS 16, 24, sp, 8 - FILTER_VER_CHROMA_SS 12, 32, sp, 8 - FILTER_VER_CHROMA_SS 4, 32, sp, 8 - FILTER_VER_CHROMA_SS 32, 64, sp, 8 - FILTER_VER_CHROMA_SS 16, 64, sp, 8 - FILTER_VER_CHROMA_SS 32, 48, sp, 8 - FILTER_VER_CHROMA_SS 24, 64, sp, 8 - - FILTER_VER_CHROMA_SS 16, 24, pp, 8 - FILTER_VER_CHROMA_SS 12, 32, pp, 8 - FILTER_VER_CHROMA_SS 4, 32, pp, 8 - FILTER_VER_CHROMA_SS 32, 64, pp, 8 - FILTER_VER_CHROMA_SS 16, 64, pp, 8 - FILTER_VER_CHROMA_SS 32, 48, pp, 8 - FILTER_VER_CHROMA_SS 24, 64, pp, 8 - - - FILTER_VER_CHROMA_SS 48, 64, ss, 6 - FILTER_VER_CHROMA_SS 64, 48, ss, 6 - FILTER_VER_CHROMA_SS 64, 64, ss, 6 - FILTER_VER_CHROMA_SS 64, 32, ss, 6 - FILTER_VER_CHROMA_SS 64, 16, ss, 6 - - FILTER_VER_CHROMA_SS 48, 64, ps, 7 - FILTER_VER_CHROMA_SS 64, 48, ps, 7 - FILTER_VER_CHROMA_SS 64, 64, ps, 7 - FILTER_VER_CHROMA_SS 64, 32, ps, 7 - FILTER_VER_CHROMA_SS 64, 16, ps, 7 - - FILTER_VER_CHROMA_SS 48, 64, sp, 8 - FILTER_VER_CHROMA_SS 64, 48, sp, 8 - FILTER_VER_CHROMA_SS 64, 64, sp, 8 - FILTER_VER_CHROMA_SS 64, 32, sp, 8 - FILTER_VER_CHROMA_SS 64, 16, sp, 8 - - FILTER_VER_CHROMA_SS 48, 64, pp, 8 - FILTER_VER_CHROMA_SS 64, 48, pp, 8 - FILTER_VER_CHROMA_SS 64, 64, pp, 8 - FILTER_VER_CHROMA_SS 64, 32, pp, 8 - FILTER_VER_CHROMA_SS 64, 16, pp, 8 - - -%macro PROCESS_CHROMA_SP_W2_4R 1 - movd m0, [r0] - movd m1, [r0 + r1] - punpcklwd m0, m1 ;m0=[0 1] - - lea r0, [r0 + 2 * r1] - movd m2, [r0] - punpcklwd m1, m2 ;m1=[1 2] - punpcklqdq m0, m1 ;m0=[0 1 1 2] - pmaddwd m0, [%1 + 0 *32] ;m0=[0+1 1+2] Row 1-2 - - movd m1, [r0 + r1] - punpcklwd m2, m1 ;m2=[2 3] - - lea r0, [r0 + 2 * r1] - movd m3, [r0] - punpcklwd m1, m3 ;m2=[3 4] - punpcklqdq m2, m1 ;m2=[2 3 3 4] - - pmaddwd m4, m2, [%1 + 1 * 32] ;m4=[2+3 3+4] Row 1-2 - pmaddwd m2, [%1 + 0 * 32] ;m2=[2+3 3+4] Row 3-4 - paddd m0, m4 ;m0=[0+1+2+3 1+2+3+4] Row 1-2 - - movd m1, [r0 + r1] - punpcklwd m3, m1 ;m3=[4 5] - - movd m4, [r0 + 2 * r1] - punpcklwd m1, m4 ;m1=[5 6] - punpcklqdq m3, m1 ;m2=[4 5 5 6] - pmaddwd m3, [%1 + 1 * 32] ;m3=[4+5 5+6] Row 3-4 - paddd m2, m3 ;m2=[2+3+4+5 3+4+5+6] Row 3-4 -%endmacro -;--------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vertical_%2_2x%1(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;--------------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_W2 3 -INIT_XMM sse4 -cglobal interp_4tap_vert_%2_2x%1, 5, 6, %3 - - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - - mov r4d, (%1/4) -%ifnidn %2, ss - %ifnidn %2, ps - pxor m7, m7 - mova m6, [pw_pixel_max] - %ifidn %2, pp - mova m5, [INTERP_OFFSET_PP] - %else - mova m5, [INTERP_OFFSET_SP] - %endif - %else - mova m5, [INTERP_OFFSET_PS] - %endif -%endif - -.loopH: - PROCESS_CHROMA_SP_W2_4R r5 -%ifidn %2, ss - psrad m0, 6 - psrad m2, 6 - packssdw m0, m2 -%elifidn %2, ps - paddd m0, m5 - paddd m2, m5 - psrad m0, INTERP_SHIFT_PS - psrad m2, INTERP_SHIFT_PS - packssdw m0, m2 -%else - paddd m0, m5 - paddd m2, m5 - %ifidn %2, pp - psrad m0, INTERP_SHIFT_PP - psrad m2, INTERP_SHIFT_PP - %else - psrad m0, INTERP_SHIFT_SP - psrad m2, INTERP_SHIFT_SP - %endif - packusdw m0, m2 - CLIPW m0, m7, m6 -%endif - - movd [r2], m0 - pextrd [r2 + r3], m0, 1 - lea r2, [r2 + 2 * r3] - pextrd [r2], m0, 2 - pextrd [r2 + r3], m0, 3 - - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loopH - RET -%endmacro - -FILTER_VER_CHROMA_W2 4, ss, 5 -FILTER_VER_CHROMA_W2 8, ss, 5 - -FILTER_VER_CHROMA_W2 4, pp, 8 -FILTER_VER_CHROMA_W2 8, pp, 8 - -FILTER_VER_CHROMA_W2 4, ps, 6 -FILTER_VER_CHROMA_W2 8, ps, 6 - -FILTER_VER_CHROMA_W2 4, sp, 8 -FILTER_VER_CHROMA_W2 8, sp, 8 - -FILTER_VER_CHROMA_W2 16, ss, 5 -FILTER_VER_CHROMA_W2 16, pp, 8 -FILTER_VER_CHROMA_W2 16, ps, 6 -FILTER_VER_CHROMA_W2 16, sp, 8 - - -;--------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_%1_4x2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;--------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_W4 3 -INIT_XMM sse4 -cglobal interp_4tap_vert_%2_4x%1, 5, 6, %3 - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - -%ifnidn %2, 2 - mov r4d, %1/2 -%endif - -%ifnidn %2, ss - %ifnidn %2, ps - pxor m6, m6 - mova m5, [pw_pixel_max] - %ifidn %2, pp - mova m4, [INTERP_OFFSET_PP] - %else - mova m4, [INTERP_OFFSET_SP] - %endif - %else - mova m4, [INTERP_OFFSET_PS] - %endif -%endif - -%ifnidn %2, 2 -.loop: -%endif - - movh m0, [r0] - movh m1, [r0 + r1] - punpcklwd m0, m1 ;m0=[0 1] - pmaddwd m0, [r5 + 0 *32] ;m0=[0+1] Row1 - - lea r0, [r0 + 2 * r1] - movh m2, [r0] - punpcklwd m1, m2 ;m1=[1 2] - pmaddwd m1, [r5 + 0 *32] ;m1=[1+2] Row2 - - movh m3, [r0 + r1] - punpcklwd m2, m3 ;m4=[2 3] - pmaddwd m2, [r5 + 1 * 32] - paddd m0, m2 ;m0=[0+1+2+3] Row1 done - - movh m2, [r0 + 2 * r1] - punpcklwd m3, m2 ;m5=[3 4] - pmaddwd m3, [r5 + 1 * 32] - paddd m1, m3 ;m1=[1+2+3+4] Row2 done - -%ifidn %2, ss - psrad m0, 6 - psrad m1, 6 - packssdw m0, m1 -%elifidn %2, ps - paddd m0, m4 - paddd m1, m4 - psrad m0, INTERP_SHIFT_PS - psrad m1, INTERP_SHIFT_PS - packssdw m0, m1 -%else - paddd m0, m4 - paddd m1, m4 - %ifidn %2, pp - psrad m0, INTERP_SHIFT_PP - psrad m1, INTERP_SHIFT_PP - %else - psrad m0, INTERP_SHIFT_SP - psrad m1, INTERP_SHIFT_SP - %endif - packusdw m0, m1 - CLIPW m0, m6, m5 -%endif - - movh [r2], m0 - movhps [r2 + r3], m0 - -%ifnidn %2, 2 - lea r2, [r2 + r3 * 2] - dec r4d - jnz .loop -%endif - RET -%endmacro - -FILTER_VER_CHROMA_W4 2, ss, 4 -FILTER_VER_CHROMA_W4 2, pp, 7 -FILTER_VER_CHROMA_W4 2, ps, 5 -FILTER_VER_CHROMA_W4 2, sp, 7 - -FILTER_VER_CHROMA_W4 4, ss, 4 -FILTER_VER_CHROMA_W4 4, pp, 7 -FILTER_VER_CHROMA_W4 4, ps, 5 -FILTER_VER_CHROMA_W4 4, sp, 7 - -;------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vertical_%1_6x8(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;------------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_W6 3 -INIT_XMM sse4 -cglobal interp_4tap_vert_%2_6x%1, 5, 7, %3 - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r6, [r5 + r4] -%else - lea r6, [tab_ChromaCoeffV + r4] -%endif - - mov r4d, %1/4 - -%ifnidn %2, ss - %ifnidn %2, ps - mova m7, [pw_pixel_max] - %ifidn %2, pp - mova m6, [INTERP_OFFSET_PP] - %else - mova m6, [INTERP_OFFSET_SP] - %endif - %else - mova m6, [INTERP_OFFSET_PS] - %endif -%endif - -.loopH: - PROCESS_CHROMA_SP_W4_4R - -%ifidn %2, ss - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - - packssdw m0, m1 - packssdw m2, m3 -%elifidn %2, ps - paddd m0, m6 - paddd m1, m6 - paddd m2, m6 - paddd m3, m6 - psrad m0, INTERP_SHIFT_PS - psrad m1, INTERP_SHIFT_PS - psrad m2, INTERP_SHIFT_PS - psrad m3, INTERP_SHIFT_PS - - packssdw m0, m1 - packssdw m2, m3 -%else - paddd m0, m6 - paddd m1, m6 - paddd m2, m6 - paddd m3, m6 - %ifidn %2, pp - psrad m0, INTERP_SHIFT_PP - psrad m1, INTERP_SHIFT_PP - psrad m2, INTERP_SHIFT_PP - psrad m3, INTERP_SHIFT_PP - %else - psrad m0, INTERP_SHIFT_SP - psrad m1, INTERP_SHIFT_SP - psrad m2, INTERP_SHIFT_SP - psrad m3, INTERP_SHIFT_SP - %endif - packssdw m0, m1 - packssdw m2, m3 - pxor m5, m5 - CLIPW2 m0, m2, m5, m7 -%endif - - movh [r2], m0 - movhps [r2 + r3], m0 - lea r5, [r2 + 2 * r3] - movh [r5], m2 - movhps [r5 + r3], m2 - - lea r5, [4 * r1 - 2 * 4] - sub r0, r5 - add r2, 2 * 4 - - PROCESS_CHROMA_SP_W2_4R r6 - -%ifidn %2, ss - psrad m0, 6 - psrad m2, 6 - packssdw m0, m2 -%elifidn %2, ps - paddd m0, m6 - paddd m2, m6 - psrad m0, INTERP_SHIFT_PS - psrad m2, INTERP_SHIFT_PS - packssdw m0, m2 -%else - paddd m0, m6 - paddd m2, m6 - %ifidn %2, pp - psrad m0, INTERP_SHIFT_PP - psrad m2, INTERP_SHIFT_PP - %else - psrad m0, INTERP_SHIFT_SP - psrad m2, INTERP_SHIFT_SP - %endif - packusdw m0, m2 - CLIPW m0, m5, m7 -%endif - - movd [r2], m0 - pextrd [r2 + r3], m0, 1 - lea r2, [r2 + 2 * r3] - pextrd [r2], m0, 2 - pextrd [r2 + r3], m0, 3 - - sub r0, 2 * 4 - lea r2, [r2 + 2 * r3 - 2 * 4] - - dec r4d - jnz .loopH - RET -%endmacro - -FILTER_VER_CHROMA_W6 8, ss, 6 -FILTER_VER_CHROMA_W6 8, ps, 7 -FILTER_VER_CHROMA_W6 8, sp, 8 -FILTER_VER_CHROMA_W6 8, pp, 8 - -FILTER_VER_CHROMA_W6 16, ss, 6 -FILTER_VER_CHROMA_W6 16, ps, 7 -FILTER_VER_CHROMA_W6 16, sp, 8 -FILTER_VER_CHROMA_W6 16, pp, 8 - -%macro PROCESS_CHROMA_SP_W8_2R 0 - movu m1, [r0] - movu m3, [r0 + r1] - punpcklwd m0, m1, m3 - pmaddwd m0, [r5 + 0 * 32] ;m0 = [0l+1l] Row1l - punpckhwd m1, m3 - pmaddwd m1, [r5 + 0 * 32] ;m1 = [0h+1h] Row1h - - movu m4, [r0 + 2 * r1] - punpcklwd m2, m3, m4 - pmaddwd m2, [r5 + 0 * 32] ;m2 = [1l+2l] Row2l - punpckhwd m3, m4 - pmaddwd m3, [r5 + 0 * 32] ;m3 = [1h+2h] Row2h - - lea r0, [r0 + 2 * r1] - movu m5, [r0 + r1] - punpcklwd m6, m4, m5 - pmaddwd m6, [r5 + 1 * 32] ;m6 = [2l+3l] Row1l - paddd m0, m6 ;m0 = [0l+1l+2l+3l] Row1l sum - punpckhwd m4, m5 - pmaddwd m4, [r5 + 1 * 32] ;m6 = [2h+3h] Row1h - paddd m1, m4 ;m1 = [0h+1h+2h+3h] Row1h sum - - movu m4, [r0 + 2 * r1] - punpcklwd m6, m5, m4 - pmaddwd m6, [r5 + 1 * 32] ;m6 = [3l+4l] Row2l - paddd m2, m6 ;m2 = [1l+2l+3l+4l] Row2l sum - punpckhwd m5, m4 - pmaddwd m5, [r5 + 1 * 32] ;m1 = [3h+4h] Row2h - paddd m3, m5 ;m3 = [1h+2h+3h+4h] Row2h sum -%endmacro - -;---------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_%3_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;---------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_W8 4 -INIT_XMM sse2 -cglobal interp_4tap_vert_%3_%1x%2, 5, 6, %4 - - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - - mov r4d, %2/2 - -%ifidn %3, pp - mova m7, [INTERP_OFFSET_PP] -%elifidn %3, sp - mova m7, [INTERP_OFFSET_SP] -%elifidn %3, ps - mova m7, [INTERP_OFFSET_PS] -%endif - -.loopH: - PROCESS_CHROMA_SP_W8_2R - -%ifidn %3, ss - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - - packssdw m0, m1 - packssdw m2, m3 -%elifidn %3, ps - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 - psrad m0, INTERP_SHIFT_PS - psrad m1, INTERP_SHIFT_PS - psrad m2, INTERP_SHIFT_PS - psrad m3, INTERP_SHIFT_PS - - packssdw m0, m1 - packssdw m2, m3 -%else - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 - %ifidn %3, pp - psrad m0, INTERP_SHIFT_PP - psrad m1, INTERP_SHIFT_PP - psrad m2, INTERP_SHIFT_PP - psrad m3, INTERP_SHIFT_PP - %else - psrad m0, INTERP_SHIFT_SP - psrad m1, INTERP_SHIFT_SP - psrad m2, INTERP_SHIFT_SP - psrad m3, INTERP_SHIFT_SP - %endif - packssdw m0, m1 - packssdw m2, m3 - pxor m5, m5 - mova m6, [pw_pixel_max] - CLIPW2 m0, m2, m5, m6 -%endif - - movu [r2], m0 - movu [r2 + r3], m2 - - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loopH - RET -%endmacro - -FILTER_VER_CHROMA_W8 8, 2, ss, 7 -FILTER_VER_CHROMA_W8 8, 4, ss, 7 -FILTER_VER_CHROMA_W8 8, 6, ss, 7 -FILTER_VER_CHROMA_W8 8, 8, ss, 7 -FILTER_VER_CHROMA_W8 8, 16, ss, 7 -FILTER_VER_CHROMA_W8 8, 32, ss, 7 - -FILTER_VER_CHROMA_W8 8, 2, sp, 8 -FILTER_VER_CHROMA_W8 8, 4, sp, 8 -FILTER_VER_CHROMA_W8 8, 6, sp, 8 -FILTER_VER_CHROMA_W8 8, 8, sp, 8 -FILTER_VER_CHROMA_W8 8, 16, sp, 8 -FILTER_VER_CHROMA_W8 8, 32, sp, 8 - -FILTER_VER_CHROMA_W8 8, 2, ps, 8 -FILTER_VER_CHROMA_W8 8, 4, ps, 8 -FILTER_VER_CHROMA_W8 8, 6, ps, 8 -FILTER_VER_CHROMA_W8 8, 8, ps, 8 -FILTER_VER_CHROMA_W8 8, 16, ps, 8 -FILTER_VER_CHROMA_W8 8, 32, ps, 8 - -FILTER_VER_CHROMA_W8 8, 2, pp, 8 -FILTER_VER_CHROMA_W8 8, 4, pp, 8 -FILTER_VER_CHROMA_W8 8, 6, pp, 8 -FILTER_VER_CHROMA_W8 8, 8, pp, 8 -FILTER_VER_CHROMA_W8 8, 16, pp, 8 -FILTER_VER_CHROMA_W8 8, 32, pp, 8 - -FILTER_VER_CHROMA_W8 8, 12, ss, 7 -FILTER_VER_CHROMA_W8 8, 64, ss, 7 -FILTER_VER_CHROMA_W8 8, 12, sp, 8 -FILTER_VER_CHROMA_W8 8, 64, sp, 8 -FILTER_VER_CHROMA_W8 8, 12, ps, 8 -FILTER_VER_CHROMA_W8 8, 64, ps, 8 -FILTER_VER_CHROMA_W8 8, 12, pp, 8 -FILTER_VER_CHROMA_W8 8, 64, pp, 8 - -%macro PROCESS_CHROMA_VERT_W16_2R 0 - movu m1, [r0] - movu m3, [r0 + r1] - punpcklwd m0, m1, m3 - pmaddwd m0, [r5 + 0 * 32] - punpckhwd m1, m3 - pmaddwd m1, [r5 + 0 * 32] - - movu m4, [r0 + 2 * r1] - punpcklwd m2, m3, m4 - pmaddwd m2, [r5 + 0 * 32] - punpckhwd m3, m4 - pmaddwd m3, [r5 + 0 * 32] - - lea r0, [r0 + 2 * r1] - movu m5, [r0 + r1] - punpcklwd m6, m4, m5 - pmaddwd m6, [r5 + 1 * 32] - paddd m0, m6 - punpckhwd m4, m5 - pmaddwd m4, [r5 + 1 * 32] - paddd m1, m4 - - movu m4, [r0 + 2 * r1] - punpcklwd m6, m5, m4 - pmaddwd m6, [r5 + 1 * 32] - paddd m2, m6 - punpckhwd m5, m4 - pmaddwd m5, [r5 + 1 * 32] - paddd m3, m5 -%endmacro - -;----------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_AVX2_6xN 2 -INIT_YMM avx2 -%if ARCH_X86_64 -cglobal interp_4tap_vert_%2_6x%1, 4, 7, 10 - mov r4d, r4m - add r1d, r1d - add r3d, r3d - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - - sub r0, r1 - mov r6d, %1/4 - -%ifidn %2,pp - vbroadcasti128 m8, [INTERP_OFFSET_PP] -%elifidn %2, sp - vbroadcasti128 m8, [INTERP_OFFSET_SP] -%else - vbroadcasti128 m8, [INTERP_OFFSET_PS] -%endif - -.loopH: - movu xm0, [r0] - movu xm1, [r0 + r1] - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - - movu xm2, [r0 + r1 * 2] - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - - lea r4, [r1 * 3] - movu xm3, [r0 + r4] - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - pmaddwd m2, [r5] - paddd m0, m4 - - lea r0, [r0 + r1 * 4] - movu xm4, [r0] - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - pmaddwd m3, [r5] - paddd m1, m5 - - movu xm5, [r0 + r1] - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - pmaddwd m4, [r5] - paddd m2, m6 - - movu xm6, [r0 + r1 * 2] - punpckhwd xm7, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm7, 1 - pmaddwd m7, m5, [r5 + 1 * mmsize] - pmaddwd m5, [r5] - paddd m3, m7 - lea r4, [r3 * 3] -%ifidn %2,ss - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 -%else - paddd m0, m8 - paddd m1, m8 - paddd m2, m8 - paddd m3, m8 -%ifidn %2,pp - psrad m0, INTERP_SHIFT_PP - psrad m1, INTERP_SHIFT_PP - psrad m2, INTERP_SHIFT_PP - psrad m3, INTERP_SHIFT_PP -%elifidn %2, sp - psrad m0, INTERP_SHIFT_SP - psrad m1, INTERP_SHIFT_SP - psrad m2, INTERP_SHIFT_SP - psrad m3, INTERP_SHIFT_SP -%else - psrad m0, INTERP_SHIFT_PS - psrad m1, INTERP_SHIFT_PS - psrad m2, INTERP_SHIFT_PS - psrad m3, INTERP_SHIFT_PS -%endif -%endif - - packssdw m0, m1 - packssdw m2, m3 - vpermq m0, m0, q3120 - vpermq m2, m2, q3120 - pxor m5, m5 - mova m9, [pw_pixel_max] -%ifidn %2,pp - CLIPW m0, m5, m9 - CLIPW m2, m5, m9 -%elifidn %2, sp - CLIPW m0, m5, m9 - CLIPW m2, m5, m9 -%endif - - vextracti128 xm1, m0, 1 - vextracti128 xm3, m2, 1 - movq [r2], xm0 - pextrd [r2 + 8], xm0, 2 - movq [r2 + r3], xm1 - pextrd [r2 + r3 + 8], xm1, 2 - movq [r2 + r3 * 2], xm2 - pextrd [r2 + r3 * 2 + 8], xm2, 2 - movq [r2 + r4], xm3 - pextrd [r2 + r4 + 8], xm3, 2 - - lea r2, [r2 + r3 * 4] - dec r6d - jnz .loopH - RET -%endif -%endmacro -FILTER_VER_CHROMA_AVX2_6xN 8, pp -FILTER_VER_CHROMA_AVX2_6xN 8, ps -FILTER_VER_CHROMA_AVX2_6xN 8, ss -FILTER_VER_CHROMA_AVX2_6xN 8, sp -FILTER_VER_CHROMA_AVX2_6xN 16, pp -FILTER_VER_CHROMA_AVX2_6xN 16, ps -FILTER_VER_CHROMA_AVX2_6xN 16, ss -FILTER_VER_CHROMA_AVX2_6xN 16, sp - -;----------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_W16_16xN_avx2 3 -INIT_YMM avx2 -cglobal interp_4tap_vert_%2_16x%1, 5, 6, %3 - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - - mov r4d, %1/2 - -%ifidn %2, pp - vbroadcasti128 m7, [INTERP_OFFSET_PP] -%elifidn %2, sp - vbroadcasti128 m7, [INTERP_OFFSET_SP] -%elifidn %2, ps - vbroadcasti128 m7, [INTERP_OFFSET_PS] -%endif - -.loopH: - PROCESS_CHROMA_VERT_W16_2R -%ifidn %2, ss - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - - packssdw m0, m1 - packssdw m2, m3 -%elifidn %2, ps - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 - psrad m0, INTERP_SHIFT_PS - psrad m1, INTERP_SHIFT_PS - psrad m2, INTERP_SHIFT_PS - psrad m3, INTERP_SHIFT_PS - - packssdw m0, m1 - packssdw m2, m3 -%else - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 - %ifidn %2, pp - psrad m0, INTERP_SHIFT_PP - psrad m1, INTERP_SHIFT_PP - psrad m2, INTERP_SHIFT_PP - psrad m3, INTERP_SHIFT_PP -%else - psrad m0, INTERP_SHIFT_SP - psrad m1, INTERP_SHIFT_SP - psrad m2, INTERP_SHIFT_SP - psrad m3, INTERP_SHIFT_SP -%endif - packssdw m0, m1 - packssdw m2, m3 - pxor m5, m5 - CLIPW2 m0, m2, m5, [pw_pixel_max] -%endif - - movu [r2], m0 - movu [r2 + r3], m2 - lea r2, [r2 + 2 * r3] - dec r4d - jnz .loopH - RET -%endmacro - FILTER_VER_CHROMA_W16_16xN_avx2 4, pp, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 8, pp, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 12, pp, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 24, pp, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 16, pp, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 32, pp, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 64, pp, 8 - - FILTER_VER_CHROMA_W16_16xN_avx2 4, ps, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 8, ps, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 12, ps, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 24, ps, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 16, ps, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 32, ps, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 64, ps, 8 - - FILTER_VER_CHROMA_W16_16xN_avx2 4, ss, 7 - FILTER_VER_CHROMA_W16_16xN_avx2 8, ss, 7 - FILTER_VER_CHROMA_W16_16xN_avx2 12, ss, 7 - FILTER_VER_CHROMA_W16_16xN_avx2 24, ss, 7 - FILTER_VER_CHROMA_W16_16xN_avx2 16, ss, 7 - FILTER_VER_CHROMA_W16_16xN_avx2 32, ss, 7 - FILTER_VER_CHROMA_W16_16xN_avx2 64, ss, 7 - - FILTER_VER_CHROMA_W16_16xN_avx2 4, sp, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 8, sp, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 12, sp, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 24, sp, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 16, sp, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 32, sp, 8 - FILTER_VER_CHROMA_W16_16xN_avx2 64, sp, 8 - -%macro PROCESS_CHROMA_VERT_W32_2R 0 - movu m1, [r0] - movu m3, [r0 + r1] - punpcklwd m0, m1, m3 - pmaddwd m0, [r5 + 0 * mmsize] - punpckhwd m1, m3 - pmaddwd m1, [r5 + 0 * mmsize] - - movu m9, [r0 + mmsize] - movu m11, [r0 + r1 + mmsize] - punpcklwd m8, m9, m11 - pmaddwd m8, [r5 + 0 * mmsize] - punpckhwd m9, m11 - pmaddwd m9, [r5 + 0 * mmsize] - - movu m4, [r0 + 2 * r1] - punpcklwd m2, m3, m4 - pmaddwd m2, [r5 + 0 * mmsize] - punpckhwd m3, m4 - pmaddwd m3, [r5 + 0 * mmsize] - - movu m12, [r0 + 2 * r1 + mmsize] - punpcklwd m10, m11, m12 - pmaddwd m10, [r5 + 0 * mmsize] - punpckhwd m11, m12 - pmaddwd m11, [r5 + 0 * mmsize] - - lea r6, [r0 + 2 * r1] - movu m5, [r6 + r1] - punpcklwd m6, m4, m5 - pmaddwd m6, [r5 + 1 * mmsize] - paddd m0, m6 - punpckhwd m4, m5 - pmaddwd m4, [r5 + 1 * mmsize] - paddd m1, m4 - - movu m13, [r6 + r1 + mmsize] - punpcklwd m14, m12, m13 - pmaddwd m14, [r5 + 1 * mmsize] - paddd m8, m14 - punpckhwd m12, m13 - pmaddwd m12, [r5 + 1 * mmsize] - paddd m9, m12 - - movu m4, [r6 + 2 * r1] - punpcklwd m6, m5, m4 - pmaddwd m6, [r5 + 1 * mmsize] - paddd m2, m6 - punpckhwd m5, m4 - pmaddwd m5, [r5 + 1 * mmsize] - paddd m3, m5 - - movu m12, [r6 + 2 * r1 + mmsize] - punpcklwd m14, m13, m12 - pmaddwd m14, [r5 + 1 * mmsize] - paddd m10, m14 - punpckhwd m13, m12 - pmaddwd m13, [r5 + 1 * mmsize] - paddd m11, m13 -%endmacro - -;----------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_W16_32xN_avx2 3 -INIT_YMM avx2 -%if ARCH_X86_64 -cglobal interp_4tap_vert_%2_32x%1, 5, 7, %3 - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - mov r4d, %1/2 - -%ifidn %2, pp - vbroadcasti128 m7, [INTERP_OFFSET_PP] -%elifidn %2, sp - vbroadcasti128 m7, [INTERP_OFFSET_SP] -%elifidn %2, ps - vbroadcasti128 m7, [INTERP_OFFSET_PS] -%endif - -.loopH: - PROCESS_CHROMA_VERT_W32_2R -%ifidn %2, ss - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - - psrad m8, 6 - psrad m9, 6 - psrad m10, 6 - psrad m11, 6 - - packssdw m0, m1 - packssdw m2, m3 - packssdw m8, m9 - packssdw m10, m11 -%elifidn %2, ps - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 - psrad m0, INTERP_SHIFT_PS - psrad m1, INTERP_SHIFT_PS - psrad m2, INTERP_SHIFT_PS - psrad m3, INTERP_SHIFT_PS - paddd m8, m7 - paddd m9, m7 - paddd m10, m7 - paddd m11, m7 - psrad m8, INTERP_SHIFT_PS - psrad m9, INTERP_SHIFT_PS - psrad m10, INTERP_SHIFT_PS - psrad m11, INTERP_SHIFT_PS - - packssdw m0, m1 - packssdw m2, m3 - packssdw m8, m9 - packssdw m10, m11 -%else - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 - paddd m8, m7 - paddd m9, m7 - paddd m10, m7 - paddd m11, m7 - %ifidn %2, pp - psrad m0, INTERP_SHIFT_PP - psrad m1, INTERP_SHIFT_PP - psrad m2, INTERP_SHIFT_PP - psrad m3, INTERP_SHIFT_PP - psrad m8, INTERP_SHIFT_PP - psrad m9, INTERP_SHIFT_PP - psrad m10, INTERP_SHIFT_PP - psrad m11, INTERP_SHIFT_PP -%else - psrad m0, INTERP_SHIFT_SP - psrad m1, INTERP_SHIFT_SP - psrad m2, INTERP_SHIFT_SP - psrad m3, INTERP_SHIFT_SP - psrad m8, INTERP_SHIFT_SP - psrad m9, INTERP_SHIFT_SP - psrad m10, INTERP_SHIFT_SP - psrad m11, INTERP_SHIFT_SP -%endif - packssdw m0, m1 - packssdw m2, m3 - packssdw m8, m9 - packssdw m10, m11 - pxor m5, m5 - CLIPW2 m0, m2, m5, [pw_pixel_max] - CLIPW2 m8, m10, m5, [pw_pixel_max] -%endif - - movu [r2], m0 - movu [r2 + r3], m2 - movu [r2 + mmsize], m8 - movu [r2 + r3 + mmsize], m10 - lea r2, [r2 + 2 * r3] - lea r0, [r0 + 2 * r1] - dec r4d - jnz .loopH - RET -%endif -%endmacro - FILTER_VER_CHROMA_W16_32xN_avx2 8, pp, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 16, pp, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 24, pp, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 32, pp, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 48, pp, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 64, pp, 15 - - FILTER_VER_CHROMA_W16_32xN_avx2 8, ps, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 16, ps, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 24, ps, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 32, ps, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 48, ps, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 64, ps, 15 - - FILTER_VER_CHROMA_W16_32xN_avx2 8, ss, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 16, ss, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 24, ss, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 32, ss, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 48, ss, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 64, ss, 15 - - FILTER_VER_CHROMA_W16_32xN_avx2 8, sp, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 16, sp, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 24, sp, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 32, sp, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 48, sp, 15 - FILTER_VER_CHROMA_W16_32xN_avx2 64, sp, 15 - -;----------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_W16_64xN_avx2 3 -INIT_YMM avx2 -cglobal interp_4tap_vert_%2_64x%1, 5, 7, %3 - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - mov r4d, %1/2 - -%ifidn %2, pp - vbroadcasti128 m7, [INTERP_OFFSET_PP] -%elifidn %2, sp - vbroadcasti128 m7, [INTERP_OFFSET_SP] -%elifidn %2, ps - vbroadcasti128 m7, [INTERP_OFFSET_PS] -%endif - -.loopH: -%assign x 0 -%rep 4 - movu m1, [r0 + x] - movu m3, [r0 + r1 + x] - movu m5, [r5 + 0 * mmsize] - punpcklwd m0, m1, m3 - pmaddwd m0, m5 - punpckhwd m1, m3 - pmaddwd m1, m5 - - movu m4, [r0 + 2 * r1 + x] - punpcklwd m2, m3, m4 - pmaddwd m2, m5 - punpckhwd m3, m4 - pmaddwd m3, m5 - - lea r6, [r0 + 2 * r1] - movu m5, [r6 + r1 + x] - punpcklwd m6, m4, m5 - pmaddwd m6, [r5 + 1 * mmsize] - paddd m0, m6 - punpckhwd m4, m5 - pmaddwd m4, [r5 + 1 * mmsize] - paddd m1, m4 - - movu m4, [r6 + 2 * r1 + x] - punpcklwd m6, m5, m4 - pmaddwd m6, [r5 + 1 * mmsize] - paddd m2, m6 - punpckhwd m5, m4 - pmaddwd m5, [r5 + 1 * mmsize] - paddd m3, m5 - -%ifidn %2, ss - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - - packssdw m0, m1 - packssdw m2, m3 -%elifidn %2, ps - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 - psrad m0, INTERP_SHIFT_PS - psrad m1, INTERP_SHIFT_PS - psrad m2, INTERP_SHIFT_PS - psrad m3, INTERP_SHIFT_PS - - packssdw m0, m1 - packssdw m2, m3 -%else - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 -%ifidn %2, pp - psrad m0, INTERP_SHIFT_PP - psrad m1, INTERP_SHIFT_PP - psrad m2, INTERP_SHIFT_PP - psrad m3, INTERP_SHIFT_PP -%else - psrad m0, INTERP_SHIFT_SP - psrad m1, INTERP_SHIFT_SP - psrad m2, INTERP_SHIFT_SP - psrad m3, INTERP_SHIFT_SP -%endif - packssdw m0, m1 - packssdw m2, m3 - pxor m5, m5 - CLIPW2 m0, m2, m5, [pw_pixel_max] -%endif - - movu [r2 + x], m0 - movu [r2 + r3 + x], m2 -%assign x x+mmsize -%endrep - - lea r2, [r2 + 2 * r3] - lea r0, [r0 + 2 * r1] - dec r4d - jnz .loopH - RET -%endmacro - FILTER_VER_CHROMA_W16_64xN_avx2 16, ss, 7 - FILTER_VER_CHROMA_W16_64xN_avx2 32, ss, 7 - FILTER_VER_CHROMA_W16_64xN_avx2 48, ss, 7 - FILTER_VER_CHROMA_W16_64xN_avx2 64, ss, 7 - FILTER_VER_CHROMA_W16_64xN_avx2 16, sp, 8 - FILTER_VER_CHROMA_W16_64xN_avx2 32, sp, 8 - FILTER_VER_CHROMA_W16_64xN_avx2 48, sp, 8 - FILTER_VER_CHROMA_W16_64xN_avx2 64, sp, 8 - FILTER_VER_CHROMA_W16_64xN_avx2 16, ps, 8 - FILTER_VER_CHROMA_W16_64xN_avx2 32, ps, 8 - FILTER_VER_CHROMA_W16_64xN_avx2 48, ps, 8 - FILTER_VER_CHROMA_W16_64xN_avx2 64, ps, 8 - FILTER_VER_CHROMA_W16_64xN_avx2 16, pp, 8 - FILTER_VER_CHROMA_W16_64xN_avx2 32, pp, 8 - FILTER_VER_CHROMA_W16_64xN_avx2 48, pp, 8 - FILTER_VER_CHROMA_W16_64xN_avx2 64, pp, 8 - -;----------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_W16_12xN_avx2 3 -INIT_YMM avx2 -cglobal interp_4tap_vert_%2_12x%1, 5, 8, %3 - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - mov r4d, %1/2 - -%ifidn %2, pp - vbroadcasti128 m7, [INTERP_OFFSET_PP] -%elifidn %2, sp - vbroadcasti128 m7, [INTERP_OFFSET_SP] -%elifidn %2, ps - vbroadcasti128 m7, [INTERP_OFFSET_PS] -%endif - -.loopH: - PROCESS_CHROMA_VERT_W16_2R -%ifidn %2, ss - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - - packssdw m0, m1 - packssdw m2, m3 -%elifidn %2, ps - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 - psrad m0, INTERP_SHIFT_PS - psrad m1, INTERP_SHIFT_PS - psrad m2, INTERP_SHIFT_PS - psrad m3, INTERP_SHIFT_PS - - packssdw m0, m1 - packssdw m2, m3 -%else - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 - %ifidn %2, pp - psrad m0, INTERP_SHIFT_PP - psrad m1, INTERP_SHIFT_PP - psrad m2, INTERP_SHIFT_PP - psrad m3, INTERP_SHIFT_PP -%else - psrad m0, INTERP_SHIFT_SP - psrad m1, INTERP_SHIFT_SP - psrad m2, INTERP_SHIFT_SP - psrad m3, INTERP_SHIFT_SP -%endif - packssdw m0, m1 - packssdw m2, m3 - pxor m5, m5 - CLIPW2 m0, m2, m5, [pw_pixel_max] -%endif - - movu [r2], xm0 - movu [r2 + r3], xm2 - vextracti128 xm0, m0, 1 - vextracti128 xm2, m2, 1 - movq [r2 + 16], xm0 - movq [r2 + r3 + 16], xm2 - lea r2, [r2 + 2 * r3] - dec r4d - jnz .loopH - RET -%endmacro - FILTER_VER_CHROMA_W16_12xN_avx2 16, ss, 7 - FILTER_VER_CHROMA_W16_12xN_avx2 16, sp, 8 - FILTER_VER_CHROMA_W16_12xN_avx2 16, ps, 8 - FILTER_VER_CHROMA_W16_12xN_avx2 16, pp, 8 - FILTER_VER_CHROMA_W16_12xN_avx2 32, ss, 7 - FILTER_VER_CHROMA_W16_12xN_avx2 32, sp, 8 - FILTER_VER_CHROMA_W16_12xN_avx2 32, ps, 8 - FILTER_VER_CHROMA_W16_12xN_avx2 32, pp, 8 - -%macro PROCESS_CHROMA_VERT_W24_2R 0 - movu m1, [r0] - movu m3, [r0 + r1] - punpcklwd m0, m1, m3 - pmaddwd m0, [r5 + 0 * mmsize] - punpckhwd m1, m3 - pmaddwd m1, [r5 + 0 * mmsize] - - movu xm9, [r0 + mmsize] - movu xm11, [r0 + r1 + mmsize] - punpcklwd xm8, xm9, xm11 - pmaddwd xm8, [r5 + 0 * mmsize] - punpckhwd xm9, xm11 - pmaddwd xm9, [r5 + 0 * mmsize] - - movu m4, [r0 + 2 * r1] - punpcklwd m2, m3, m4 - pmaddwd m2, [r5 + 0 * mmsize] - punpckhwd m3, m4 - pmaddwd m3, [r5 + 0 * mmsize] - - movu xm12, [r0 + 2 * r1 + mmsize] - punpcklwd xm10, xm11, xm12 - pmaddwd xm10, [r5 + 0 * mmsize] - punpckhwd xm11, xm12 - pmaddwd xm11, [r5 + 0 * mmsize] - - lea r6, [r0 + 2 * r1] - movu m5, [r6 + r1] - punpcklwd m6, m4, m5 - pmaddwd m6, [r5 + 1 * mmsize] - paddd m0, m6 - punpckhwd m4, m5 - pmaddwd m4, [r5 + 1 * mmsize] - paddd m1, m4 - - movu xm13, [r6 + r1 + mmsize] - punpcklwd xm14, xm12, xm13 - pmaddwd xm14, [r5 + 1 * mmsize] - paddd xm8, xm14 - punpckhwd xm12, xm13 - pmaddwd xm12, [r5 + 1 * mmsize] - paddd xm9, xm12 - - movu m4, [r6 + 2 * r1] - punpcklwd m6, m5, m4 - pmaddwd m6, [r5 + 1 * mmsize] - paddd m2, m6 - punpckhwd m5, m4 - pmaddwd m5, [r5 + 1 * mmsize] - paddd m3, m5 - - movu xm12, [r6 + 2 * r1 + mmsize] - punpcklwd xm14, xm13, xm12 - pmaddwd xm14, [r5 + 1 * mmsize] - paddd xm10, xm14 - punpckhwd xm13, xm12 - pmaddwd xm13, [r5 + 1 * mmsize] - paddd xm11, xm13 -%endmacro - -;----------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_W16_24xN_avx2 3 -INIT_YMM avx2 -%if ARCH_X86_64 -cglobal interp_4tap_vert_%2_24x%1, 5, 7, %3 - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - mov r4d, %1/2 - -%ifidn %2, pp - vbroadcasti128 m7, [INTERP_OFFSET_PP] -%elifidn %2, sp - vbroadcasti128 m7, [INTERP_OFFSET_SP] -%elifidn %2, ps - vbroadcasti128 m7, [INTERP_OFFSET_PS] -%endif - -.loopH: - PROCESS_CHROMA_VERT_W24_2R -%ifidn %2, ss - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - - psrad m8, 6 - psrad m9, 6 - psrad m10, 6 - psrad m11, 6 - - packssdw m0, m1 - packssdw m2, m3 - packssdw m8, m9 - packssdw m10, m11 -%elifidn %2, ps - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 - psrad m0, INTERP_SHIFT_PS - psrad m1, INTERP_SHIFT_PS - psrad m2, INTERP_SHIFT_PS - psrad m3, INTERP_SHIFT_PS - paddd m8, m7 - paddd m9, m7 - paddd m10, m7 - paddd m11, m7 - psrad m8, INTERP_SHIFT_PS - psrad m9, INTERP_SHIFT_PS - psrad m10, INTERP_SHIFT_PS - psrad m11, INTERP_SHIFT_PS - - packssdw m0, m1 - packssdw m2, m3 - packssdw m8, m9 - packssdw m10, m11 -%else - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 - paddd m8, m7 - paddd m9, m7 - paddd m10, m7 - paddd m11, m7 - %ifidn %2, pp - psrad m0, INTERP_SHIFT_PP - psrad m1, INTERP_SHIFT_PP - psrad m2, INTERP_SHIFT_PP - psrad m3, INTERP_SHIFT_PP - psrad m8, INTERP_SHIFT_PP - psrad m9, INTERP_SHIFT_PP - psrad m10, INTERP_SHIFT_PP - psrad m11, INTERP_SHIFT_PP -%else - psrad m0, INTERP_SHIFT_SP - psrad m1, INTERP_SHIFT_SP - psrad m2, INTERP_SHIFT_SP - psrad m3, INTERP_SHIFT_SP - psrad m8, INTERP_SHIFT_SP - psrad m9, INTERP_SHIFT_SP - psrad m10, INTERP_SHIFT_SP - psrad m11, INTERP_SHIFT_SP -%endif - packssdw m0, m1 - packssdw m2, m3 - packssdw m8, m9 - packssdw m10, m11 - pxor m5, m5 - CLIPW2 m0, m2, m5, [pw_pixel_max] - CLIPW2 m8, m10, m5, [pw_pixel_max] -%endif - - movu [r2], m0 - movu [r2 + r3], m2 - movu [r2 + mmsize], xm8 - movu [r2 + r3 + mmsize], xm10 - lea r2, [r2 + 2 * r3] - lea r0, [r0 + 2 * r1] - dec r4d - jnz .loopH - RET -%endif -%endmacro - FILTER_VER_CHROMA_W16_24xN_avx2 32, ss, 15 - FILTER_VER_CHROMA_W16_24xN_avx2 32, sp, 15 - FILTER_VER_CHROMA_W16_24xN_avx2 32, ps, 15 - FILTER_VER_CHROMA_W16_24xN_avx2 32, pp, 15 - FILTER_VER_CHROMA_W16_24xN_avx2 64, ss, 15 - FILTER_VER_CHROMA_W16_24xN_avx2 64, sp, 15 - FILTER_VER_CHROMA_W16_24xN_avx2 64, ps, 15 - FILTER_VER_CHROMA_W16_24xN_avx2 64, pp, 15 - - -;----------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_W16_48x64_avx2 2 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_48x64, 5, 7, %2 - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - mov r4d, 32 - -%ifidn %1, pp - vbroadcasti128 m7, [INTERP_OFFSET_PP] -%elifidn %1, sp - vbroadcasti128 m7, [INTERP_OFFSET_SP] -%elifidn %1, ps - vbroadcasti128 m7, [INTERP_OFFSET_PS] -%endif - -.loopH: -%assign x 0 -%rep 3 - movu m1, [r0 + x] - movu m3, [r0 + r1 + x] - movu m5, [r5 + 0 * mmsize] - punpcklwd m0, m1, m3 - pmaddwd m0, m5 - punpckhwd m1, m3 - pmaddwd m1, m5 - - movu m4, [r0 + 2 * r1 + x] - punpcklwd m2, m3, m4 - pmaddwd m2, m5 - punpckhwd m3, m4 - pmaddwd m3, m5 - - lea r6, [r0 + 2 * r1] - movu m5, [r6 + r1 + x] - punpcklwd m6, m4, m5 - pmaddwd m6, [r5 + 1 * mmsize] - paddd m0, m6 - punpckhwd m4, m5 - pmaddwd m4, [r5 + 1 * mmsize] - paddd m1, m4 - - movu m4, [r6 + 2 * r1 + x] - punpcklwd m6, m5, m4 - pmaddwd m6, [r5 + 1 * mmsize] - paddd m2, m6 - punpckhwd m5, m4 - pmaddwd m5, [r5 + 1 * mmsize] - paddd m3, m5 - -%ifidn %1, ss - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - - packssdw m0, m1 - packssdw m2, m3 -%elifidn %1, ps - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 - psrad m0, INTERP_SHIFT_PS - psrad m1, INTERP_SHIFT_PS - psrad m2, INTERP_SHIFT_PS - psrad m3, INTERP_SHIFT_PS - - packssdw m0, m1 - packssdw m2, m3 -%else - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 -%ifidn %1, pp - psrad m0, INTERP_SHIFT_PP - psrad m1, INTERP_SHIFT_PP - psrad m2, INTERP_SHIFT_PP - psrad m3, INTERP_SHIFT_PP -%else - psrad m0, INTERP_SHIFT_SP - psrad m1, INTERP_SHIFT_SP - psrad m2, INTERP_SHIFT_SP - psrad m3, INTERP_SHIFT_SP -%endif - packssdw m0, m1 - packssdw m2, m3 - pxor m5, m5 - CLIPW2 m0, m2, m5, [pw_pixel_max] -%endif - - movu [r2 + x], m0 - movu [r2 + r3 + x], m2 -%assign x x+mmsize -%endrep - - lea r2, [r2 + 2 * r3] - lea r0, [r0 + 2 * r1] - dec r4d - jnz .loopH - RET -%endmacro - - FILTER_VER_CHROMA_W16_48x64_avx2 pp, 8 - FILTER_VER_CHROMA_W16_48x64_avx2 ps, 8 - FILTER_VER_CHROMA_W16_48x64_avx2 ss, 7 - FILTER_VER_CHROMA_W16_48x64_avx2 sp, 8 - -INIT_XMM sse2 -cglobal chroma_p2s, 3, 7, 3 - ; load width and height - mov r3d, r3m - mov r4d, r4m - add r1, r1 - - ; load constant - mova m2, [tab_c_n8192] - -.loopH: - - xor r5d, r5d -.loopW: - lea r6, [r0 + r5 * 2] - - movu m0, [r6] - psllw m0, (14 - BIT_DEPTH) - paddw m0, m2 - - movu m1, [r6 + r1] - psllw m1, (14 - BIT_DEPTH) - paddw m1, m2 - - add r5d, 8 - cmp r5d, r3d - lea r6, [r2 + r5 * 2] - jg .width4 - movu [r6 + FENC_STRIDE / 2 * 0 - 16], m0 - movu [r6 + FENC_STRIDE / 2 * 2 - 16], m1 - je .nextH - jmp .loopW - -.width4: - test r3d, 4 - jz .width2 - test r3d, 2 - movh [r6 + FENC_STRIDE / 2 * 0 - 16], m0 - movh [r6 + FENC_STRIDE / 2 * 2 - 16], m1 - lea r6, [r6 + 8] - pshufd m0, m0, 2 - pshufd m1, m1, 2 - jz .nextH - -.width2: - movd [r6 + FENC_STRIDE / 2 * 0 - 16], m0 - movd [r6 + FENC_STRIDE / 2 * 2 - 16], m1 - -.nextH: - lea r0, [r0 + r1 * 2] - add r2, FENC_STRIDE / 2 * 4 - - sub r4d, 2 - jnz .loopH - RET - -;----------------------------------------------------------------------------- -; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) -;----------------------------------------------------------------------------- -INIT_YMM avx2 -cglobal filterPixelToShort_48x64, 3, 7, 4 - add r1d, r1d - mov r3d, r3m - add r3d, r3d - lea r4, [r3 * 3] - lea r5, [r1 * 3] - - ; load height - mov r6d, 16 - - ; load constant - mova m3, [pw_2000] - -.loop: - movu m0, [r0] - movu m1, [r0 + 32] - movu m2, [r0 + 64] - psllw m0, (14 - BIT_DEPTH) - psllw m1, (14 - BIT_DEPTH) - psllw m2, (14 - BIT_DEPTH) - psubw m0, m3 - psubw m1, m3 - psubw m2, m3 - movu [r2 + r3 * 0], m0 - movu [r2 + r3 * 0 + 32], m1 - movu [r2 + r3 * 0 + 64], m2 - - movu m0, [r0 + r1] - movu m1, [r0 + r1 + 32] - movu m2, [r0 + r1 + 64] - psllw m0, (14 - BIT_DEPTH) - psllw m1, (14 - BIT_DEPTH) - psllw m2, (14 - BIT_DEPTH) - psubw m0, m3 - psubw m1, m3 - psubw m2, m3 - movu [r2 + r3 * 1], m0 - movu [r2 + r3 * 1 + 32], m1 - movu [r2 + r3 * 1 + 64], m2 - - movu m0, [r0 + r1 * 2] - movu m1, [r0 + r1 * 2 + 32] - movu m2, [r0 + r1 * 2 + 64] - psllw m0, (14 - BIT_DEPTH) - psllw m1, (14 - BIT_DEPTH) - psllw m2, (14 - BIT_DEPTH) - psubw m0, m3 - psubw m1, m3 - psubw m2, m3 - movu [r2 + r3 * 2], m0 - movu [r2 + r3 * 2 + 32], m1 - movu [r2 + r3 * 2 + 64], m2 - - movu m0, [r0 + r5] - movu m1, [r0 + r5 + 32] - movu m2, [r0 + r5 + 64] - psllw m0, (14 - BIT_DEPTH) - psllw m1, (14 - BIT_DEPTH) - psllw m2, (14 - BIT_DEPTH) - psubw m0, m3 - psubw m1, m3 - psubw m2, m3 - movu [r2 + r4], m0 - movu [r2 + r4 + 32], m1 - movu [r2 + r4 + 64], m2 - - lea r0, [r0 + r1 * 4] - lea r2, [r2 + r3 * 4] - - dec r6d - jnz .loop - RET - - %macro FILTER_VER_CHROMA_AVX2_8xN 2 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_8x%2, 4, 9, 15 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - vbroadcasti128 m14, [pd_32] -%elifidn %1, sp - vbroadcasti128 m14, [INTERP_OFFSET_SP] -%else - vbroadcasti128 m14, [INTERP_OFFSET_PS] -%endif - lea r6, [r3 * 3] - lea r7, [r1 * 4] - mov r8d, %2 / 16 -.loopH: - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - paddd m0, m4 - pmaddwd m2, [r5] - - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - paddd m1, m5 - pmaddwd m3, [r5] - - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m2, m6 - pmaddwd m4, [r5] - - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhwd xm7, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm7, 1 - pmaddwd m7, m5, [r5 + 1 * mmsize] - paddd m3, m7 - pmaddwd m5, [r5] - - movu xm7, [r0 + r4] ; m7 = row 7 - punpckhwd xm8, xm6, xm7 - punpcklwd xm6, xm7 - vinserti128 m6, m6, xm8, 1 - pmaddwd m8, m6, [r5 + 1 * mmsize] - paddd m4, m8 - pmaddwd m6, [r5] - - lea r0, [r0 + r1 * 4] - movu xm8, [r0] ; m8 = row 8 - punpckhwd xm9, xm7, xm8 - punpcklwd xm7, xm8 - vinserti128 m7, m7, xm9, 1 - pmaddwd m9, m7, [r5 + 1 * mmsize] - paddd m5, m9 - pmaddwd m7, [r5] - - - movu xm9, [r0 + r1] ; m9 = row 9 - punpckhwd xm10, xm8, xm9 - punpcklwd xm8, xm9 - vinserti128 m8, m8, xm10, 1 - pmaddwd m10, m8, [r5 + 1 * mmsize] - paddd m6, m10 - pmaddwd m8, [r5] - - - movu xm10, [r0 + r1 * 2] ; m10 = row 10 - punpckhwd xm11, xm9, xm10 - punpcklwd xm9, xm10 - vinserti128 m9, m9, xm11, 1 - pmaddwd m11, m9, [r5 + 1 * mmsize] - paddd m7, m11 - pmaddwd m9, [r5] - - movu xm11, [r0 + r4] ; m11 = row 11 - punpckhwd xm12, xm10, xm11 - punpcklwd xm10, xm11 - vinserti128 m10, m10, xm12, 1 - pmaddwd m12, m10, [r5 + 1 * mmsize] - paddd m8, m12 - pmaddwd m10, [r5] - - lea r0, [r0 + r1 * 4] - movu xm12, [r0] ; m12 = row 12 - punpckhwd xm13, xm11, xm12 - punpcklwd xm11, xm12 - vinserti128 m11, m11, xm13, 1 - pmaddwd m13, m11, [r5 + 1 * mmsize] - paddd m9, m13 - pmaddwd m11, [r5] - -%ifidn %1,ss - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - psrad m4, 6 - psrad m5, 6 -%else - paddd m0, m14 - paddd m1, m14 - paddd m2, m14 - paddd m3, m14 - paddd m4, m14 - paddd m5, m14 -%ifidn %1,pp - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - psrad m4, 6 - psrad m5, 6 -%elifidn %1, sp - psrad m0, INTERP_SHIFT_SP - psrad m1, INTERP_SHIFT_SP - psrad m2, INTERP_SHIFT_SP - psrad m3, INTERP_SHIFT_SP - psrad m4, INTERP_SHIFT_SP - psrad m5, INTERP_SHIFT_SP -%else - psrad m0, INTERP_SHIFT_PS - psrad m1, INTERP_SHIFT_PS - psrad m2, INTERP_SHIFT_PS - psrad m3, INTERP_SHIFT_PS - psrad m4, INTERP_SHIFT_PS - psrad m5, INTERP_SHIFT_PS -%endif -%endif - - packssdw m0, m1 - packssdw m2, m3 - packssdw m4, m5 - vpermq m0, m0, q3120 - vpermq m2, m2, q3120 - vpermq m4, m4, q3120 - pxor m5, m5 - mova m3, [pw_pixel_max] -%ifidn %1,pp - CLIPW m0, m5, m3 - CLIPW m2, m5, m3 - CLIPW m4, m5, m3 -%elifidn %1, sp - CLIPW m0, m5, m3 - CLIPW m2, m5, m3 - CLIPW m4, m5, m3 -%endif - - vextracti128 xm1, m0, 1 - movu [r2], xm0 - movu [r2 + r3], xm1 - vextracti128 xm1, m2, 1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm1 - lea r2, [r2 + r3 * 4] - vextracti128 xm1, m4, 1 - movu [r2], xm4 - movu [r2 + r3], xm1 - - movu xm13, [r0 + r1] ; m13 = row 13 - punpckhwd xm0, xm12, xm13 - punpcklwd xm12, xm13 - vinserti128 m12, m12, xm0, 1 - pmaddwd m0, m12, [r5 + 1 * mmsize] - paddd m10, m0 - pmaddwd m12, [r5] - - movu xm0, [r0 + r1 * 2] ; m0 = row 14 - punpckhwd xm1, xm13, xm0 - punpcklwd xm13, xm0 - vinserti128 m13, m13, xm1, 1 - pmaddwd m1, m13, [r5 + 1 * mmsize] - paddd m11, m1 - pmaddwd m13, [r5] - -%ifidn %1,ss - psrad m6, 6 - psrad m7, 6 -%else - paddd m6, m14 - paddd m7, m14 -%ifidn %1,pp - psrad m6, 6 - psrad m7, 6 -%elifidn %1, sp - psrad m6, INTERP_SHIFT_SP - psrad m7, INTERP_SHIFT_SP -%else - psrad m6, INTERP_SHIFT_PS - psrad m7, INTERP_SHIFT_PS -%endif -%endif - - packssdw m6, m7 - vpermq m6, m6, q3120 -%ifidn %1,pp - CLIPW m6, m5, m3 -%elifidn %1, sp - CLIPW m6, m5, m3 -%endif - vextracti128 xm7, m6, 1 - movu [r2 + r3 * 2], xm6 - movu [r2 + r6], xm7 - - movu xm1, [r0 + r4] ; m1 = row 15 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m2, m0, [r5 + 1 * mmsize] - paddd m12, m2 - pmaddwd m0, [r5] - - lea r0, [r0 + r1 * 4] - movu xm2, [r0] ; m2 = row 16 - punpckhwd xm6, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm6, 1 - pmaddwd m6, m1, [r5 + 1 * mmsize] - paddd m13, m6 - pmaddwd m1, [r5] - - movu xm6, [r0 + r1] ; m6 = row 17 - punpckhwd xm4, xm2, xm6 - punpcklwd xm2, xm6 - vinserti128 m2, m2, xm4, 1 - pmaddwd m2, [r5 + 1 * mmsize] - paddd m0, m2 - - movu xm4, [r0 + r1 * 2] ; m4 = row 18 - punpckhwd xm2, xm6, xm4 - punpcklwd xm6, xm4 - vinserti128 m6, m6, xm2, 1 - pmaddwd m6, [r5 + 1 * mmsize] - paddd m1, m6 - -%ifidn %1,ss - psrad m8, 6 - psrad m9, 6 - psrad m10, 6 - psrad m11, 6 - psrad m12, 6 - psrad m13, 6 - psrad m0, 6 - psrad m1, 6 -%else - paddd m8, m14 - paddd m9, m14 - paddd m10, m14 - paddd m11, m14 - paddd m12, m14 - paddd m13, m14 - paddd m0, m14 - paddd m1, m14 -%ifidn %1,pp - psrad m8, 6 - psrad m9, 6 - psrad m10, 6 - psrad m11, 6 - psrad m12, 6 - psrad m13, 6 - psrad m0, 6 - psrad m1, 6 -%elifidn %1, sp - psrad m8, INTERP_SHIFT_SP - psrad m9, INTERP_SHIFT_SP - psrad m10, INTERP_SHIFT_SP - psrad m11, INTERP_SHIFT_SP - psrad m12, INTERP_SHIFT_SP - psrad m13, INTERP_SHIFT_SP - psrad m0, INTERP_SHIFT_SP - psrad m1, INTERP_SHIFT_SP -%else - psrad m8, INTERP_SHIFT_PS - psrad m9, INTERP_SHIFT_PS - psrad m10, INTERP_SHIFT_PS - psrad m11, INTERP_SHIFT_PS - psrad m12, INTERP_SHIFT_PS - psrad m13, INTERP_SHIFT_PS - psrad m0, INTERP_SHIFT_PS - psrad m1, INTERP_SHIFT_PS -%endif -%endif - - packssdw m8, m9 - packssdw m10, m11 - packssdw m12, m13 - packssdw m0, m1 - vpermq m8, m8, q3120 - vpermq m10, m10, q3120 - vpermq m12, m12, q3120 - vpermq m0, m0, q3120 -%ifidn %1,pp - CLIPW m8, m5, m3 - CLIPW m10, m5, m3 - CLIPW m12, m5, m3 - CLIPW m0, m5, m3 -%elifidn %1, sp - CLIPW m8, m5, m3 - CLIPW m10, m5, m3 - CLIPW m12, m5, m3 - CLIPW m0, m5, m3 -%endif - vextracti128 xm9, m8, 1 - vextracti128 xm11, m10, 1 - vextracti128 xm13, m12, 1 - vextracti128 xm1, m0, 1 - lea r2, [r2 + r3 * 4] - movu [r2], xm8 - movu [r2 + r3], xm9 - movu [r2 + r3 * 2], xm10 - movu [r2 + r6], xm11 - lea r2, [r2 + r3 * 4] - movu [r2], xm12 - movu [r2 + r3], xm13 - movu [r2 + r3 * 2], xm0 - movu [r2 + r6], xm1 - lea r2, [r2 + r3 * 4] - dec r8d - jnz .loopH - RET -%endif -%endmacro - -FILTER_VER_CHROMA_AVX2_8xN pp, 16 -FILTER_VER_CHROMA_AVX2_8xN ps, 16 -FILTER_VER_CHROMA_AVX2_8xN ss, 16 -FILTER_VER_CHROMA_AVX2_8xN sp, 16 -FILTER_VER_CHROMA_AVX2_8xN pp, 32 -FILTER_VER_CHROMA_AVX2_8xN ps, 32 -FILTER_VER_CHROMA_AVX2_8xN sp, 32 -FILTER_VER_CHROMA_AVX2_8xN ss, 32 -FILTER_VER_CHROMA_AVX2_8xN pp, 64 -FILTER_VER_CHROMA_AVX2_8xN ps, 64 -FILTER_VER_CHROMA_AVX2_8xN sp, 64 -FILTER_VER_CHROMA_AVX2_8xN ss, 64 - -%macro PROCESS_CHROMA_AVX2_8x2 3 - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m2, m2, [r5 + 1 * mmsize] - paddd m0, m2 - - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m3, m3, [r5 + 1 * mmsize] - paddd m1, m3 - -%ifnidn %1,ss - paddd m0, m7 - paddd m1, m7 -%endif - psrad m0, %3 - psrad m1, %3 - - packssdw m0, m1 - vpermq m0, m0, q3120 - pxor m4, m4 - -%if %2 - CLIPW m0, m4, [pw_pixel_max] -%endif - vextracti128 xm1, m0, 1 -%endmacro - - -%macro FILTER_VER_CHROMA_AVX2_8x2 3 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x2, 4, 6, 8 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - vbroadcasti128 m7, [pd_32] -%elifidn %1, sp - vbroadcasti128 m7, [INTERP_OFFSET_SP] -%else - vbroadcasti128 m7, [INTERP_OFFSET_PS] -%endif - - PROCESS_CHROMA_AVX2_8x2 %1, %2, %3 - movu [r2], xm0 - movu [r2 + r3], xm1 - RET -%endmacro - -FILTER_VER_CHROMA_AVX2_8x2 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_8x2 ps, 0, INTERP_SHIFT_PS -FILTER_VER_CHROMA_AVX2_8x2 sp, 1, INTERP_SHIFT_SP -FILTER_VER_CHROMA_AVX2_8x2 ss, 0, 6 - -%macro FILTER_VER_CHROMA_AVX2_4x2 3 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_4x2, 4, 6, 7 - mov r4d, r4m - add r1d, r1d - add r3d, r3d - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 - -%ifidn %1,pp - vbroadcasti128 m6, [pd_32] -%elifidn %1, sp - vbroadcasti128 m6, [INTERP_OFFSET_SP] -%else - vbroadcasti128 m6, [INTERP_OFFSET_PS] -%endif - - movq xm0, [r0] ; row 0 - movq xm1, [r0 + r1] ; row 1 - punpcklwd xm0, xm1 - - movq xm2, [r0 + r1 * 2] ; row 2 - punpcklwd xm1, xm2 - vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] - pmaddwd m0, [r5] - - movq xm3, [r0 + r4] ; row 3 - punpcklwd xm2, xm3 - lea r0, [r0 + 4 * r1] - movq xm4, [r0] ; row 4 - punpcklwd xm3, xm4 - vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] - pmaddwd m5, m2, [r5 + 1 * mmsize] - paddd m0, m5 - -%ifnidn %1, ss - paddd m0, m6 -%endif - psrad m0, %3 - packssdw m0, m0 - pxor m1, m1 - -%if %2 - CLIPW m0, m1, [pw_pixel_max] -%endif - - vextracti128 xm2, m0, 1 - lea r4, [r3 * 3] - movq [r2], xm0 - movq [r2 + r3], xm2 - RET -%endmacro - -FILTER_VER_CHROMA_AVX2_4x2 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_4x2 ps, 0, INTERP_SHIFT_PS -FILTER_VER_CHROMA_AVX2_4x2 sp, 1, INTERP_SHIFT_SP -FILTER_VER_CHROMA_AVX2_4x2 ss, 0, 6 - -%macro FILTER_VER_CHROMA_AVX2_4x4 3 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_4x4, 4, 6, 7 - mov r4d, r4m - add r1d, r1d - add r3d, r3d - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 - -%ifidn %1,pp - vbroadcasti128 m6, [pd_32] -%elifidn %1, sp - vbroadcasti128 m6, [INTERP_OFFSET_SP] -%else - vbroadcasti128 m6, [INTERP_OFFSET_PS] -%endif - movq xm0, [r0] ; row 0 - movq xm1, [r0 + r1] ; row 1 - punpcklwd xm0, xm1 - - movq xm2, [r0 + r1 * 2] ; row 2 - punpcklwd xm1, xm2 - vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] - pmaddwd m0, [r5] - - movq xm3, [r0 + r4] ; row 3 - punpcklwd xm2, xm3 - lea r0, [r0 + 4 * r1] - movq xm4, [r0] ; row 4 - punpcklwd xm3, xm4 - vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] - pmaddwd m5, m2, [r5 + 1 * mmsize] - pmaddwd m2, [r5] - paddd m0, m5 - - movq xm3, [r0 + r1] ; row 5 - punpcklwd xm4, xm3 - movq xm1, [r0 + r1 * 2] ; row 6 - punpcklwd xm3, xm1 - vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] - pmaddwd m4, [r5 + 1 * mmsize] - paddd m2, m4 - -%ifnidn %1,ss - paddd m0, m6 - paddd m2, m6 -%endif - psrad m0, %3 - psrad m2, %3 - - packssdw m0, m2 - pxor m1, m1 -%if %2 - CLIPW m0, m1, [pw_pixel_max] -%endif - - vextracti128 xm2, m0, 1 - lea r4, [r3 * 3] - movq [r2], xm0 - movq [r2 + r3], xm2 - movhps [r2 + r3 * 2], xm0 - movhps [r2 + r4], xm2 - RET -%endmacro - -FILTER_VER_CHROMA_AVX2_4x4 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_4x4 ps, 0, INTERP_SHIFT_PS -FILTER_VER_CHROMA_AVX2_4x4 sp, 1, INTERP_SHIFT_SP -FILTER_VER_CHROMA_AVX2_4x4 ss, 0, 6 - - -%macro FILTER_VER_CHROMA_AVX2_4x8 3 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_4x8, 4, 7, 8 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 - -%ifidn %1,pp - vbroadcasti128 m7, [pd_32] -%elifidn %1, sp - vbroadcasti128 m7, [INTERP_OFFSET_SP] -%else - vbroadcasti128 m7, [INTERP_OFFSET_PS] -%endif - lea r6, [r3 * 3] - - movq xm0, [r0] ; row 0 - movq xm1, [r0 + r1] ; row 1 - punpcklwd xm0, xm1 - movq xm2, [r0 + r1 * 2] ; row 2 - punpcklwd xm1, xm2 - vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] - pmaddwd m0, [r5] - - movq xm3, [r0 + r4] ; row 3 - punpcklwd xm2, xm3 - lea r0, [r0 + 4 * r1] - movq xm4, [r0] ; row 4 - punpcklwd xm3, xm4 - vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] - pmaddwd m5, m2, [r5 + 1 * mmsize] - pmaddwd m2, [r5] - paddd m0, m5 - - movq xm3, [r0 + r1] ; row 5 - punpcklwd xm4, xm3 - movq xm1, [r0 + r1 * 2] ; row 6 - punpcklwd xm3, xm1 - vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] - pmaddwd m5, m4, [r5 + 1 * mmsize] - paddd m2, m5 - pmaddwd m4, [r5] - - movq xm3, [r0 + r4] ; row 7 - punpcklwd xm1, xm3 - lea r0, [r0 + 4 * r1] - movq xm6, [r0] ; row 8 - punpcklwd xm3, xm6 - vinserti128 m1, m1, xm3, 1 ; m1 = [8 7 7 6] - pmaddwd m5, m1, [r5 + 1 * mmsize] - paddd m4, m5 - pmaddwd m1, [r5] - - movq xm3, [r0 + r1] ; row 9 - punpcklwd xm6, xm3 - movq xm5, [r0 + 2 * r1] ; row 10 - punpcklwd xm3, xm5 - vinserti128 m6, m6, xm3, 1 ; m6 = [A 9 9 8] - pmaddwd m6, [r5 + 1 * mmsize] - paddd m1, m6 -%ifnidn %1,ss - paddd m0, m7 - paddd m2, m7 -%endif - psrad m0, %3 - psrad m2, %3 - packssdw m0, m2 - pxor m6, m6 - mova m3, [pw_pixel_max] -%if %2 - CLIPW m0, m6, m3 -%endif - vextracti128 xm2, m0, 1 - movq [r2], xm0 - movq [r2 + r3], xm2 - movhps [r2 + r3 * 2], xm0 - movhps [r2 + r6], xm2 -%ifnidn %1,ss - paddd m4, m7 - paddd m1, m7 -%endif - psrad m4, %3 - psrad m1, %3 - packssdw m4, m1 -%if %2 - CLIPW m4, m6, m3 -%endif - vextracti128 xm1, m4, 1 - lea r2, [r2 + r3 * 4] - movq [r2], xm4 - movq [r2 + r3], xm1 - movhps [r2 + r3 * 2], xm4 - movhps [r2 + r6], xm1 - RET -%endmacro - -FILTER_VER_CHROMA_AVX2_4x8 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_4x8 ps, 0, INTERP_SHIFT_PS -FILTER_VER_CHROMA_AVX2_4x8 sp, 1, INTERP_SHIFT_SP -FILTER_VER_CHROMA_AVX2_4x8 ss, 0 , 6 - -%macro PROCESS_LUMA_AVX2_W4_16R_4TAP 3 - movq xm0, [r0] ; row 0 - movq xm1, [r0 + r1] ; row 1 - punpcklwd xm0, xm1 - movq xm2, [r0 + r1 * 2] ; row 2 - punpcklwd xm1, xm2 - vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] - pmaddwd m0, [r5] - movq xm3, [r0 + r4] ; row 3 - punpcklwd xm2, xm3 - lea r0, [r0 + 4 * r1] - movq xm4, [r0] ; row 4 - punpcklwd xm3, xm4 - vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] - pmaddwd m5, m2, [r5 + 1 * mmsize] - pmaddwd m2, [r5] - paddd m0, m5 - movq xm3, [r0 + r1] ; row 5 - punpcklwd xm4, xm3 - movq xm1, [r0 + r1 * 2] ; row 6 - punpcklwd xm3, xm1 - vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] - pmaddwd m5, m4, [r5 + 1 * mmsize] - paddd m2, m5 - pmaddwd m4, [r5] - movq xm3, [r0 + r4] ; row 7 - punpcklwd xm1, xm3 - lea r0, [r0 + 4 * r1] - movq xm6, [r0] ; row 8 - punpcklwd xm3, xm6 - vinserti128 m1, m1, xm3, 1 ; m1 = [8 7 7 6] - pmaddwd m5, m1, [r5 + 1 * mmsize] - paddd m4, m5 - pmaddwd m1, [r5] - movq xm3, [r0 + r1] ; row 9 - punpcklwd xm6, xm3 - movq xm5, [r0 + 2 * r1] ; row 10 - punpcklwd xm3, xm5 - vinserti128 m6, m6, xm3, 1 ; m6 = [10 9 9 8] - pmaddwd m3, m6, [r5 + 1 * mmsize] - paddd m1, m3 - pmaddwd m6, [r5] -%ifnidn %1,ss - paddd m0, m7 - paddd m2, m7 -%endif - psrad m0, %3 - psrad m2, %3 - packssdw m0, m2 - pxor m3, m3 -%if %2 - CLIPW m0, m3, [pw_pixel_max] -%endif - vextracti128 xm2, m0, 1 - movq [r2], xm0 - movq [r2 + r3], xm2 - movhps [r2 + r3 * 2], xm0 - movhps [r2 + r6], xm2 - movq xm2, [r0 + r4] ;row 11 - punpcklwd xm5, xm2 - lea r0, [r0 + 4 * r1] - movq xm0, [r0] ; row 12 - punpcklwd xm2, xm0 - vinserti128 m5, m5, xm2, 1 ; m5 = [12 11 11 10] - pmaddwd m2, m5, [r5 + 1 * mmsize] - paddd m6, m2 - pmaddwd m5, [r5] - movq xm2, [r0 + r1] ; row 13 - punpcklwd xm0, xm2 - movq xm3, [r0 + 2 * r1] ; row 14 - punpcklwd xm2, xm3 - vinserti128 m0, m0, xm2, 1 ; m0 = [14 13 13 12] - pmaddwd m2, m0, [r5 + 1 * mmsize] - paddd m5, m2 - pmaddwd m0, [r5] -%ifnidn %1,ss - paddd m4, m7 - paddd m1, m7 -%endif - psrad m4, %3 - psrad m1, %3 - packssdw m4, m1 - pxor m2, m2 -%if %2 - CLIPW m4, m2, [pw_pixel_max] -%endif - - vextracti128 xm1, m4, 1 - lea r2, [r2 + r3 * 4] - movq [r2], xm4 - movq [r2 + r3], xm1 - movhps [r2 + r3 * 2], xm4 - movhps [r2 + r6], xm1 - movq xm4, [r0 + r4] ; row 15 - punpcklwd xm3, xm4 - lea r0, [r0 + 4 * r1] - movq xm1, [r0] ; row 16 - punpcklwd xm4, xm1 - vinserti128 m3, m3, xm4, 1 ; m3 = [16 15 15 14] - pmaddwd m4, m3, [r5 + 1 * mmsize] - paddd m0, m4 - pmaddwd m3, [r5] - movq xm4, [r0 + r1] ; row 17 - punpcklwd xm1, xm4 - movq xm2, [r0 + 2 * r1] ; row 18 - punpcklwd xm4, xm2 - vinserti128 m1, m1, xm4, 1 ; m1 = [18 17 17 16] - pmaddwd m1, [r5 + 1 * mmsize] - paddd m3, m1 - -%ifnidn %1,ss - paddd m6, m7 - paddd m5, m7 -%endif - psrad m6, %3 - psrad m5, %3 - packssdw m6, m5 - pxor m1, m1 -%if %2 - CLIPW m6, m1, [pw_pixel_max] -%endif - vextracti128 xm5, m6, 1 - lea r2, [r2 + r3 * 4] - movq [r2], xm6 - movq [r2 + r3], xm5 - movhps [r2 + r3 * 2], xm6 - movhps [r2 + r6], xm5 -%ifnidn %1,ss - paddd m0, m7 - paddd m3, m7 -%endif - psrad m0, %3 - psrad m3, %3 - packssdw m0, m3 -%if %2 - CLIPW m0, m1, [pw_pixel_max] -%endif - vextracti128 xm3, m0, 1 - lea r2, [r2 + r3 * 4] - movq [r2], xm0 - movq [r2 + r3], xm3 - movhps [r2 + r3 * 2], xm0 - movhps [r2 + r6], xm3 -%endmacro - -%macro FILTER_VER_CHROMA_AVX2_4xN 4 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_4x%2, 4, 8, 8 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 - mov r7d, %2 / 16 -%ifidn %1,pp - vbroadcasti128 m7, [pd_32] -%elifidn %1, sp - vbroadcasti128 m7, [INTERP_OFFSET_SP] -%else - vbroadcasti128 m7, [INTERP_OFFSET_PS] -%endif - lea r6, [r3 * 3] -.loopH: - PROCESS_LUMA_AVX2_W4_16R_4TAP %1, %3, %4 - lea r2, [r2 + r3 * 4] - dec r7d - jnz .loopH - RET -%endmacro - -FILTER_VER_CHROMA_AVX2_4xN pp, 16, 1, 6 -FILTER_VER_CHROMA_AVX2_4xN ps, 16, 0, INTERP_SHIFT_PS -FILTER_VER_CHROMA_AVX2_4xN sp, 16, 1, INTERP_SHIFT_SP -FILTER_VER_CHROMA_AVX2_4xN ss, 16, 0, 6 -FILTER_VER_CHROMA_AVX2_4xN pp, 32, 1, 6 -FILTER_VER_CHROMA_AVX2_4xN ps, 32, 0, INTERP_SHIFT_PS -FILTER_VER_CHROMA_AVX2_4xN sp, 32, 1, INTERP_SHIFT_SP -FILTER_VER_CHROMA_AVX2_4xN ss, 32, 0, 6 - -%macro FILTER_VER_CHROMA_AVX2_8x8 3 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_8x8, 4, 6, 12 - mov r4d, r4m - add r1d, r1d - add r3d, r3d - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 - -%ifidn %1,pp - vbroadcasti128 m11, [pd_32] -%elifidn %1, sp - vbroadcasti128 m11, [INTERP_OFFSET_SP] -%else - vbroadcasti128 m11, [INTERP_OFFSET_PS] -%endif - - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - pmaddwd m2, [r5] - paddd m0, m4 ; res row0 done(0,1,2,3) - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - pmaddwd m3, [r5] - paddd m1, m5 ;res row1 done(1, 2, 3, 4) - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - pmaddwd m4, [r5] - paddd m2, m6 ;res row2 done(2,3,4,5) - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhwd xm7, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm7, 1 - pmaddwd m7, m5, [r5 + 1 * mmsize] - pmaddwd m5, [r5] - paddd m3, m7 ;res row3 done(3,4,5,6) - movu xm7, [r0 + r4] ; m7 = row 7 - punpckhwd xm8, xm6, xm7 - punpcklwd xm6, xm7 - vinserti128 m6, m6, xm8, 1 - pmaddwd m8, m6, [r5 + 1 * mmsize] - pmaddwd m6, [r5] - paddd m4, m8 ;res row4 done(4,5,6,7) - lea r0, [r0 + r1 * 4] - movu xm8, [r0] ; m8 = row 8 - punpckhwd xm9, xm7, xm8 - punpcklwd xm7, xm8 - vinserti128 m7, m7, xm9, 1 - pmaddwd m9, m7, [r5 + 1 * mmsize] - pmaddwd m7, [r5] - paddd m5, m9 ;res row5 done(5,6,7,8) - movu xm9, [r0 + r1] ; m9 = row 9 - punpckhwd xm10, xm8, xm9 - punpcklwd xm8, xm9 - vinserti128 m8, m8, xm10, 1 - pmaddwd m8, [r5 + 1 * mmsize] - paddd m6, m8 ;res row6 done(6,7,8,9) - movu xm10, [r0 + r1 * 2] ; m10 = row 10 - punpckhwd xm8, xm9, xm10 - punpcklwd xm9, xm10 - vinserti128 m9, m9, xm8, 1 - pmaddwd m9, [r5 + 1 * mmsize] - paddd m7, m9 ;res row7 done 7,8,9,10 - lea r4, [r3 * 3] -%ifnidn %1,ss - paddd m0, m11 - paddd m1, m11 - paddd m2, m11 - paddd m3, m11 -%endif - psrad m0, %3 - psrad m1, %3 - psrad m2, %3 - psrad m3, %3 - packssdw m0, m1 - packssdw m2, m3 - vpermq m0, m0, q3120 - vpermq m2, m2, q3120 - pxor m1, m1 - mova m3, [pw_pixel_max] -%if %2 - CLIPW m0, m1, m3 - CLIPW m2, m1, m3 -%endif - vextracti128 xm9, m0, 1 - vextracti128 xm8, m2, 1 - movu [r2], xm0 - movu [r2 + r3], xm9 - movu [r2 + r3 * 2], xm2 - movu [r2 + r4], xm8 -%ifnidn %1,ss - paddd m4, m11 - paddd m5, m11 - paddd m6, m11 - paddd m7, m11 -%endif - psrad m4, %3 - psrad m5, %3 - psrad m6, %3 - psrad m7, %3 - packssdw m4, m5 - packssdw m6, m7 - vpermq m4, m4, q3120 - vpermq m6, m6, q3120 -%if %2 - CLIPW m4, m1, m3 - CLIPW m6, m1, m3 -%endif - vextracti128 xm5, m4, 1 - vextracti128 xm7, m6, 1 - lea r2, [r2 + r3 * 4] - movu [r2], xm4 - movu [r2 + r3], xm5 - movu [r2 + r3 * 2], xm6 - movu [r2 + r4], xm7 - RET -%endif -%endmacro - -FILTER_VER_CHROMA_AVX2_8x8 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_8x8 ps, 0, INTERP_SHIFT_PS -FILTER_VER_CHROMA_AVX2_8x8 sp, 1, INTERP_SHIFT_SP -FILTER_VER_CHROMA_AVX2_8x8 ss, 0, 6 - -%macro FILTER_VER_CHROMA_AVX2_8x6 3 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_8x6, 4, 6, 12 - mov r4d, r4m - add r1d, r1d - add r3d, r3d - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 - -%ifidn %1,pp - vbroadcasti128 m11, [pd_32] -%elifidn %1, sp - vbroadcasti128 m11, [INTERP_OFFSET_SP] -%else - vbroadcasti128 m11, [INTERP_OFFSET_PS] -%endif - - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - pmaddwd m2, [r5] - paddd m0, m4 ; r0 done(0,1,2,3) - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - pmaddwd m3, [r5] - paddd m1, m5 ;r1 done(1, 2, 3, 4) - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - pmaddwd m4, [r5] - paddd m2, m6 ;r2 done(2,3,4,5) - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhwd xm7, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm7, 1 - pmaddwd m7, m5, [r5 + 1 * mmsize] - pmaddwd m5, [r5] - paddd m3, m7 ;r3 done(3,4,5,6) - movu xm7, [r0 + r4] ; m7 = row 7 - punpckhwd xm8, xm6, xm7 - punpcklwd xm6, xm7 - vinserti128 m6, m6, xm8, 1 - pmaddwd m8, m6, [r5 + 1 * mmsize] - paddd m4, m8 ;r4 done(4,5,6,7) - lea r0, [r0 + r1 * 4] - movu xm8, [r0] ; m8 = row 8 - punpckhwd xm9, xm7, xm8 - punpcklwd xm7, xm8 - vinserti128 m7, m7, xm9, 1 - pmaddwd m7, m7, [r5 + 1 * mmsize] - paddd m5, m7 ;r5 done(5,6,7,8) - lea r4, [r3 * 3] -%ifnidn %1,ss - paddd m0, m11 - paddd m1, m11 - paddd m2, m11 - paddd m3, m11 -%endif - psrad m0, %3 - psrad m1, %3 - psrad m2, %3 - psrad m3, %3 - packssdw m0, m1 - packssdw m2, m3 - vpermq m0, m0, q3120 - vpermq m2, m2, q3120 - pxor m10, m10 - mova m9, [pw_pixel_max] -%if %2 - CLIPW m0, m10, m9 - CLIPW m2, m10, m9 -%endif - vextracti128 xm1, m0, 1 - vextracti128 xm3, m2, 1 - movu [r2], xm0 - movu [r2 + r3], xm1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r4], xm3 -%ifnidn %1,ss - paddd m4, m11 - paddd m5, m11 -%endif - psrad m4, %3 - psrad m5, %3 - packssdw m4, m5 - vpermq m4, m4, 11011000b -%if %2 - CLIPW m4, m10, m9 -%endif - vextracti128 xm5, m4, 1 - lea r2, [r2 + r3 * 4] - movu [r2], xm4 - movu [r2 + r3], xm5 - RET -%endif -%endmacro - -FILTER_VER_CHROMA_AVX2_8x6 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_8x6 ps, 0, INTERP_SHIFT_PS -FILTER_VER_CHROMA_AVX2_8x6 sp, 1, INTERP_SHIFT_SP -FILTER_VER_CHROMA_AVX2_8x6 ss, 0, 6 - -%macro PROCESS_CHROMA_AVX2 3 - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - paddd m0, m4 - pmaddwd m2, [r5] - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - paddd m1, m5 - pmaddwd m3, [r5] - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m4, [r5 + 1 * mmsize] - paddd m2, m4 - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhwd xm4, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm4, 1 - pmaddwd m5, [r5 + 1 * mmsize] - paddd m3, m5 -%ifnidn %1,ss - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 -%endif - psrad m0, %3 - psrad m1, %3 - psrad m2, %3 - psrad m3, %3 - packssdw m0, m1 - packssdw m2, m3 - vpermq m0, m0, q3120 - vpermq m2, m2, q3120 - pxor m4, m4 -%if %2 - CLIPW m0, m4, [pw_pixel_max] - CLIPW m2, m4, [pw_pixel_max] -%endif - vextracti128 xm1, m0, 1 - vextracti128 xm3, m2, 1 -%endmacro - - -%macro FILTER_VER_CHROMA_AVX2_8x4 3 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x4, 4, 6, 8 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - add r3d, r3d -%ifdef PIC - lea r5, [tab_ChromaCoeffVer] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer + r4] -%endif - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - vbroadcasti128 m7, [pd_32] -%elifidn %1, sp - vbroadcasti128 m7, [INTERP_OFFSET_SP] -%else - vbroadcasti128 m7, [INTERP_OFFSET_PS] -%endif - PROCESS_CHROMA_AVX2 %1, %2, %3 - movu [r2], xm0 - movu [r2 + r3], xm1 - movu [r2 + r3 * 2], xm2 - lea r4, [r3 * 3] - movu [r2 + r4], xm3 - RET -%endmacro - -FILTER_VER_CHROMA_AVX2_8x4 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_8x4 ps, 0, INTERP_SHIFT_PS -FILTER_VER_CHROMA_AVX2_8x4 sp, 1, INTERP_SHIFT_SP -FILTER_VER_CHROMA_AVX2_8x4 ss, 0, 6 - -%macro FILTER_VER_CHROMA_AVX2_8x12 3 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_8x12, 4, 7, 15 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - vbroadcasti128 m14, [pd_32] -%elifidn %1, sp - vbroadcasti128 m14, [INTERP_OFFSET_SP] -%else - vbroadcasti128 m14, [INTERP_OFFSET_PS] -%endif - lea r6, [r3 * 3] - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - paddd m0, m4 - pmaddwd m2, [r5] - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - paddd m1, m5 - pmaddwd m3, [r5] - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m2, m6 - pmaddwd m4, [r5] - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhwd xm7, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm7, 1 - pmaddwd m7, m5, [r5 + 1 * mmsize] - paddd m3, m7 - pmaddwd m5, [r5] - movu xm7, [r0 + r4] ; m7 = row 7 - punpckhwd xm8, xm6, xm7 - punpcklwd xm6, xm7 - vinserti128 m6, m6, xm8, 1 - pmaddwd m8, m6, [r5 + 1 * mmsize] - paddd m4, m8 - pmaddwd m6, [r5] - lea r0, [r0 + r1 * 4] - movu xm8, [r0] ; m8 = row 8 - punpckhwd xm9, xm7, xm8 - punpcklwd xm7, xm8 - vinserti128 m7, m7, xm9, 1 - pmaddwd m9, m7, [r5 + 1 * mmsize] - paddd m5, m9 - pmaddwd m7, [r5] - movu xm9, [r0 + r1] ; m9 = row 9 - punpckhwd xm10, xm8, xm9 - punpcklwd xm8, xm9 - vinserti128 m8, m8, xm10, 1 - pmaddwd m10, m8, [r5 + 1 * mmsize] - paddd m6, m10 - pmaddwd m8, [r5] - movu xm10, [r0 + r1 * 2] ; m10 = row 10 - punpckhwd xm11, xm9, xm10 - punpcklwd xm9, xm10 - vinserti128 m9, m9, xm11, 1 - pmaddwd m11, m9, [r5 + 1 * mmsize] - paddd m7, m11 - pmaddwd m9, [r5] - movu xm11, [r0 + r4] ; m11 = row 11 - punpckhwd xm12, xm10, xm11 - punpcklwd xm10, xm11 - vinserti128 m10, m10, xm12, 1 - pmaddwd m12, m10, [r5 + 1 * mmsize] - paddd m8, m12 - pmaddwd m10, [r5] - lea r0, [r0 + r1 * 4] - movu xm12, [r0] ; m12 = row 12 - punpckhwd xm13, xm11, xm12 - punpcklwd xm11, xm12 - vinserti128 m11, m11, xm13, 1 - pmaddwd m13, m11, [r5 + 1 * mmsize] - paddd m9, m13 - pmaddwd m11, [r5] -%ifnidn %1,ss - paddd m0, m14 - paddd m1, m14 - paddd m2, m14 - paddd m3, m14 - paddd m4, m14 - paddd m5, m14 -%endif - psrad m0, %3 - psrad m1, %3 - psrad m2, %3 - psrad m3, %3 - psrad m4, %3 - psrad m5, %3 - packssdw m0, m1 - packssdw m2, m3 - packssdw m4, m5 - vpermq m0, m0, q3120 - vpermq m2, m2, q3120 - vpermq m4, m4, q3120 - pxor m5, m5 - mova m3, [pw_pixel_max] -%if %2 - CLIPW m0, m5, m3 - CLIPW m2, m5, m3 - CLIPW m4, m5, m3 -%endif - vextracti128 xm1, m0, 1 - movu [r2], xm0 - movu [r2 + r3], xm1 - vextracti128 xm1, m2, 1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm1 - lea r2, [r2 + r3 * 4] - vextracti128 xm1, m4, 1 - movu [r2], xm4 - movu [r2 + r3], xm1 - movu xm13, [r0 + r1] ; m13 = row 13 - punpckhwd xm0, xm12, xm13 - punpcklwd xm12, xm13 - vinserti128 m12, m12, xm0, 1 - pmaddwd m12, m12, [r5 + 1 * mmsize] - paddd m10, m12 - movu xm0, [r0 + r1 * 2] ; m0 = row 14 - punpckhwd xm1, xm13, xm0 - punpcklwd xm13, xm0 - vinserti128 m13, m13, xm1, 1 - pmaddwd m13, m13, [r5 + 1 * mmsize] - paddd m11, m13 -%ifnidn %1,ss - paddd m6, m14 - paddd m7, m14 - paddd m8, m14 - paddd m9, m14 - paddd m10, m14 - paddd m11, m14 -%endif - psrad m6, %3 - psrad m7, %3 - psrad m8, %3 - psrad m9, %3 - psrad m10, %3 - psrad m11, %3 - packssdw m6, m7 - packssdw m8, m9 - packssdw m10, m11 - vpermq m6, m6, q3120 - vpermq m8, m8, q3120 - vpermq m10, m10, q3120 -%if %2 - CLIPW m6, m5, m3 - CLIPW m8, m5, m3 - CLIPW m10, m5, m3 -%endif - vextracti128 xm7, m6, 1 - vextracti128 xm9, m8, 1 - vextracti128 xm11, m10, 1 - movu [r2 + r3 * 2], xm6 - movu [r2 + r6], xm7 - lea r2, [r2 + r3 * 4] - movu [r2], xm8 - movu [r2 + r3], xm9 - movu [r2 + r3 * 2], xm10 - movu [r2 + r6], xm11 - RET -%endif -%endmacro - -FILTER_VER_CHROMA_AVX2_8x12 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_8x12 ps, 0, INTERP_SHIFT_PS -FILTER_VER_CHROMA_AVX2_8x12 sp, 1, INTERP_SHIFT_SP -FILTER_VER_CHROMA_AVX2_8x12 ss, 0, 6 \ No newline at end of file
View file
x265_2.7.tar.gz/source/common/x86/v4-ipfilter8.asm
Deleted
@@ -1,12799 +0,0 @@ -;***************************************************************************** -;* Copyright (C) 2013-2017 MulticoreWare, Inc -;* -;* Authors: Min Chen <chenm003@163.com> -;* Nabajit Deka <nabajit@multicorewareinc.com> -;* Praveen Kumar Tiwari <praveen@multicorewareinc.com> -;* -;* This program is free software; you can redistribute it and/or modify -;* it under the terms of the GNU General Public License as published by -;* the Free Software Foundation; either version 2 of the License, or -;* (at your option) any later version. -;* -;* This program is distributed in the hope that it will be useful, -;* but WITHOUT ANY WARRANTY; without even the implied warranty of -;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -;* GNU General Public License for more details. -;* -;* You should have received a copy of the GNU General Public License -;* along with this program; if not, write to the Free Software -;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. -;* -;* This program is also available under a commercial proprietary license. -;* For more information, contact us at license @ x265.com. -;*****************************************************************************/ - -%include "x86inc.asm" -%include "x86util.asm" - -SECTION_RODATA 32 - -const v4_pd_526336, times 8 dd 8192*64+2048 - -const tab_Vm, db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1 - db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3 - -const tab_Cm, db 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3 - -const interp_vert_shuf, times 2 db 0, 2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 8, 7, 9 - times 2 db 4, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 11, 10, 12, 11, 13 - -const v4_interp4_vpp_shuf, times 2 db 0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, 13, 10, 14, 11, 15 - -const v4_interp4_vpp_shuf1, dd 0, 1, 1, 2, 2, 3, 3, 4 - dd 2, 3, 3, 4, 4, 5, 5, 6 - -const tab_ChromaCoeff, db 0, 64, 0, 0 - db -2, 58, 10, -2 - db -4, 54, 16, -2 - db -6, 46, 28, -4 - db -4, 36, 36, -4 - db -4, 28, 46, -6 - db -2, 16, 54, -4 - db -2, 10, 58, -2 - -const tabw_ChromaCoeff, dw 0, 64, 0, 0 - dw -2, 58, 10, -2 - dw -4, 54, 16, -2 - dw -6, 46, 28, -4 - dw -4, 36, 36, -4 - dw -4, 28, 46, -6 - dw -2, 16, 54, -4 - dw -2, 10, 58, -2 - -const tab_ChromaCoeffV, times 4 dw 0, 64 - times 4 dw 0, 0 - - times 4 dw -2, 58 - times 4 dw 10, -2 - - times 4 dw -4, 54 - times 4 dw 16, -2 - - times 4 dw -6, 46 - times 4 dw 28, -4 - - times 4 dw -4, 36 - times 4 dw 36, -4 - - times 4 dw -4, 28 - times 4 dw 46, -6 - - times 4 dw -2, 16 - times 4 dw 54, -4 - - times 4 dw -2, 10 - times 4 dw 58, -2 - -const tab_ChromaCoeff_V, times 8 db 0, 64 - times 8 db 0, 0 - - times 8 db -2, 58 - times 8 db 10, -2 - - times 8 db -4, 54 - times 8 db 16, -2 - - times 8 db -6, 46 - times 8 db 28, -4 - - times 8 db -4, 36 - times 8 db 36, -4 - - times 8 db -4, 28 - times 8 db 46, -6 - - times 8 db -2, 16 - times 8 db 54, -4 - - times 8 db -2, 10 - times 8 db 58, -2 - -const tab_ChromaCoeffVer_32, times 16 db 0, 64 - times 16 db 0, 0 - - times 16 db -2, 58 - times 16 db 10, -2 - - times 16 db -4, 54 - times 16 db 16, -2 - - times 16 db -6, 46 - times 16 db 28, -4 - - times 16 db -4, 36 - times 16 db 36, -4 - - times 16 db -4, 28 - times 16 db 46, -6 - - times 16 db -2, 16 - times 16 db 54, -4 - - times 16 db -2, 10 - times 16 db 58, -2 - -const pw_ChromaCoeffV, times 8 dw 0, 64 - times 8 dw 0, 0 - - times 8 dw -2, 58 - times 8 dw 10, -2 - - times 8 dw -4, 54 - times 8 dw 16, -2 - - times 8 dw -6, 46 - times 8 dw 28, -4 - - times 8 dw -4, 36 - times 8 dw 36, -4 - - times 8 dw -4, 28 - times 8 dw 46, -6 - - times 8 dw -2, 16 - times 8 dw 54, -4 - - times 8 dw -2, 10 - times 8 dw 58, -2 - -const v4_interp8_hps_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7 - -SECTION .text - -cextern pw_32 -cextern pw_512 -cextern pw_2000 - -%macro WORD_TO_DOUBLE 1 -%if ARCH_X86_64 - punpcklbw %1, m8 -%else - punpcklbw %1, %1 - psrlw %1, 8 -%endif -%endmacro - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_%1_2x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W2_H4_sse2 2 -INIT_XMM sse2 -%if ARCH_X86_64 -cglobal interp_4tap_vert_%1_2x%2, 4, 6, 9 - pxor m8, m8 -%else -cglobal interp_4tap_vert_%1_2x%2, 4, 6, 8 -%endif - mov r4d, r4m - sub r0, r1 - -%ifidn %1,pp - mova m1, [pw_32] -%elifidn %1,ps - mova m1, [pw_2000] - add r3d, r3d -%endif - -%ifdef PIC - lea r5, [tabw_ChromaCoeff] - movh m0, [r5 + r4 * 8] -%else - movh m0, [tabw_ChromaCoeff + r4 * 8] -%endif - - punpcklqdq m0, m0 - lea r5, [3 * r1] - -%assign x 1 -%rep %2/4 - movd m2, [r0] - movd m3, [r0 + r1] - movd m4, [r0 + 2 * r1] - movd m5, [r0 + r5] - - punpcklbw m2, m3 - punpcklbw m6, m4, m5 - punpcklwd m2, m6 - - WORD_TO_DOUBLE m2 - pmaddwd m2, m0 - - lea r0, [r0 + 4 * r1] - movd m6, [r0] - - punpcklbw m3, m4 - punpcklbw m7, m5, m6 - punpcklwd m3, m7 - - WORD_TO_DOUBLE m3 - pmaddwd m3, m0 - - packssdw m2, m3 - pshuflw m3, m2, q2301 - pshufhw m3, m3, q2301 - paddw m2, m3 - - movd m7, [r0 + r1] - - punpcklbw m4, m5 - punpcklbw m3, m6, m7 - punpcklwd m4, m3 - - WORD_TO_DOUBLE m4 - pmaddwd m4, m0 - - movd m3, [r0 + 2 * r1] - - punpcklbw m5, m6 - punpcklbw m7, m3 - punpcklwd m5, m7 - - WORD_TO_DOUBLE m5 - pmaddwd m5, m0 - - packssdw m4, m5 - pshuflw m5, m4, q2301 - pshufhw m5, m5, q2301 - paddw m4, m5 - -%ifidn %1,pp - psrld m2, 16 - psrld m4, 16 - packssdw m2, m4 - paddw m2, m1 - psraw m2, 6 - packuswb m2, m2 - -%if ARCH_X86_64 - movq r4, m2 - mov [r2], r4w - shr r4, 16 - mov [r2 + r3], r4w - lea r2, [r2 + 2 * r3] - shr r4, 16 - mov [r2], r4w - shr r4, 16 - mov [r2 + r3], r4w -%else - movd r4, m2 - mov [r2], r4w - shr r4, 16 - mov [r2 + r3], r4w - lea r2, [r2 + 2 * r3] - psrldq m2, 4 - movd r4, m2 - mov [r2], r4w - shr r4, 16 - mov [r2 + r3], r4w -%endif -%elifidn %1,ps - psrldq m2, 2 - psrldq m4, 2 - pshufd m2, m2, q3120 - pshufd m4, m4, q3120 - psubw m4, m1 - psubw m2, m1 - - movd [r2], m2 - psrldq m2, 4 - movd [r2 + r3], m2 - lea r2, [r2 + 2 * r3] - movd [r2], m4 - psrldq m4, 4 - movd [r2 + r3], m4 -%endif - -%if x < %2/4 - lea r2, [r2 + 2 * r3] -%endif -%assign x x+1 -%endrep - RET - -%endmacro - - FILTER_V4_W2_H4_sse2 pp, 4 - FILTER_V4_W2_H4_sse2 pp, 8 - FILTER_V4_W2_H4_sse2 pp, 16 - - FILTER_V4_W2_H4_sse2 ps, 4 - FILTER_V4_W2_H4_sse2 ps, 8 - FILTER_V4_W2_H4_sse2 ps, 16 - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_%1_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V2_W4_H4_sse2 1 -INIT_XMM sse2 -cglobal interp_4tap_vert_%1_4x2, 4, 6, 8 - mov r4d, r4m - sub r0, r1 - pxor m7, m7 - -%ifdef PIC - lea r5, [tabw_ChromaCoeff] - movh m0, [r5 + r4 * 8] -%else - movh m0, [tabw_ChromaCoeff + r4 * 8] -%endif - - lea r5, [r0 + 2 * r1] - punpcklqdq m0, m0 - movd m2, [r0] - movd m3, [r0 + r1] - movd m4, [r5] - movd m5, [r5 + r1] - - punpcklbw m2, m3 - punpcklbw m1, m4, m5 - punpcklwd m2, m1 - - movhlps m6, m2 - punpcklbw m2, m7 - punpcklbw m6, m7 - pmaddwd m2, m0 - pmaddwd m6, m0 - packssdw m2, m6 - - movd m1, [r0 + 4 * r1] - - punpcklbw m3, m4 - punpcklbw m5, m1 - punpcklwd m3, m5 - - movhlps m6, m3 - punpcklbw m3, m7 - punpcklbw m6, m7 - pmaddwd m3, m0 - pmaddwd m6, m0 - packssdw m3, m6 - - pshuflw m4, m2, q2301 - pshufhw m4, m4, q2301 - paddw m2, m4 - pshuflw m5, m3, q2301 - pshufhw m5, m5, q2301 - paddw m3, m5 - -%ifidn %1, pp - psrld m2, 16 - psrld m3, 16 - packssdw m2, m3 - - paddw m2, [pw_32] - psraw m2, 6 - packuswb m2, m2 - - movd [r2], m2 - psrldq m2, 4 - movd [r2 + r3], m2 -%elifidn %1, ps - psrldq m2, 2 - psrldq m3, 2 - pshufd m2, m2, q3120 - pshufd m3, m3, q3120 - punpcklqdq m2, m3 - - add r3d, r3d - psubw m2, [pw_2000] - movh [r2], m2 - movhps [r2 + r3], m2 -%endif - RET - -%endmacro - - FILTER_V2_W4_H4_sse2 pp - FILTER_V2_W4_H4_sse2 ps - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_%1_4x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W4_H4_sse2 2 -INIT_XMM sse2 -%if ARCH_X86_64 -cglobal interp_4tap_vert_%1_4x%2, 4, 6, 9 - pxor m8, m8 -%else -cglobal interp_4tap_vert_%1_4x%2, 4, 6, 8 -%endif - - mov r4d, r4m - sub r0, r1 - -%ifdef PIC - lea r5, [tabw_ChromaCoeff] - movh m0, [r5 + r4 * 8] -%else - movh m0, [tabw_ChromaCoeff + r4 * 8] -%endif - -%ifidn %1,pp - mova m1, [pw_32] -%elifidn %1,ps - add r3d, r3d - mova m1, [pw_2000] -%endif - - lea r5, [3 * r1] - lea r4, [3 * r3] - punpcklqdq m0, m0 - -%assign x 1 -%rep %2/4 - movd m2, [r0] - movd m3, [r0 + r1] - movd m4, [r0 + 2 * r1] - movd m5, [r0 + r5] - - punpcklbw m2, m3 - punpcklbw m6, m4, m5 - punpcklwd m2, m6 - - movhlps m6, m2 - WORD_TO_DOUBLE m2 - WORD_TO_DOUBLE m6 - pmaddwd m2, m0 - pmaddwd m6, m0 - packssdw m2, m6 - - lea r0, [r0 + 4 * r1] - movd m6, [r0] - - punpcklbw m3, m4 - punpcklbw m7, m5, m6 - punpcklwd m3, m7 - - movhlps m7, m3 - WORD_TO_DOUBLE m3 - WORD_TO_DOUBLE m7 - pmaddwd m3, m0 - pmaddwd m7, m0 - packssdw m3, m7 - - pshuflw m7, m2, q2301 - pshufhw m7, m7, q2301 - paddw m2, m7 - pshuflw m7, m3, q2301 - pshufhw m7, m7, q2301 - paddw m3, m7 - -%ifidn %1,pp - psrld m2, 16 - psrld m3, 16 - packssdw m2, m3 - paddw m2, m1 - psraw m2, 6 -%elifidn %1,ps - psrldq m2, 2 - psrldq m3, 2 - pshufd m2, m2, q3120 - pshufd m3, m3, q3120 - punpcklqdq m2, m3 - - psubw m2, m1 - movh [r2], m2 - movhps [r2 + r3], m2 -%endif - - movd m7, [r0 + r1] - - punpcklbw m4, m5 - punpcklbw m3, m6, m7 - punpcklwd m4, m3 - - movhlps m3, m4 - WORD_TO_DOUBLE m4 - WORD_TO_DOUBLE m3 - pmaddwd m4, m0 - pmaddwd m3, m0 - packssdw m4, m3 - - movd m3, [r0 + 2 * r1] - - punpcklbw m5, m6 - punpcklbw m7, m3 - punpcklwd m5, m7 - - movhlps m3, m5 - WORD_TO_DOUBLE m5 - WORD_TO_DOUBLE m3 - pmaddwd m5, m0 - pmaddwd m3, m0 - packssdw m5, m3 - - pshuflw m7, m4, q2301 - pshufhw m7, m7, q2301 - paddw m4, m7 - pshuflw m7, m5, q2301 - pshufhw m7, m7, q2301 - paddw m5, m7 - -%ifidn %1,pp - psrld m4, 16 - psrld m5, 16 - packssdw m4, m5 - - paddw m4, m1 - psraw m4, 6 - packuswb m2, m4 - - movd [r2], m2 - psrldq m2, 4 - movd [r2 + r3], m2 - psrldq m2, 4 - movd [r2 + 2 * r3], m2 - psrldq m2, 4 - movd [r2 + r4], m2 -%elifidn %1,ps - psrldq m4, 2 - psrldq m5, 2 - pshufd m4, m4, q3120 - pshufd m5, m5, q3120 - punpcklqdq m4, m5 - psubw m4, m1 - movh [r2 + 2 * r3], m4 - movhps [r2 + r4], m4 -%endif - -%if x < %2/4 - lea r2, [r2 + 4 * r3] -%endif - -%assign x x+1 -%endrep - RET - -%endmacro - - FILTER_V4_W4_H4_sse2 pp, 4 - FILTER_V4_W4_H4_sse2 pp, 8 - FILTER_V4_W4_H4_sse2 pp, 16 - FILTER_V4_W4_H4_sse2 pp, 32 - - FILTER_V4_W4_H4_sse2 ps, 4 - FILTER_V4_W4_H4_sse2 ps, 8 - FILTER_V4_W4_H4_sse2 ps, 16 - FILTER_V4_W4_H4_sse2 ps, 32 - -;----------------------------------------------------------------------------- -;void interp_4tap_vert_%1_6x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W6_H4_sse2 2 -INIT_XMM sse2 -cglobal interp_4tap_vert_%1_6x%2, 4, 7, 10 - mov r4d, r4m - sub r0, r1 - shl r4d, 5 - pxor m9, m9 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - mova m6, [r5 + r4] - mova m5, [r5 + r4 + 16] -%else - mova m6, [tab_ChromaCoeffV + r4] - mova m5, [tab_ChromaCoeffV + r4 + 16] -%endif - -%ifidn %1,pp - mova m4, [pw_32] -%elifidn %1,ps - mova m4, [pw_2000] - add r3d, r3d -%endif - lea r5, [3 * r1] - -%assign x 1 -%rep %2/4 - movq m0, [r0] - movq m1, [r0 + r1] - movq m2, [r0 + 2 * r1] - movq m3, [r0 + r5] - - punpcklbw m0, m1 - punpcklbw m1, m2 - punpcklbw m2, m3 - - movhlps m7, m0 - punpcklbw m0, m9 - punpcklbw m7, m9 - pmaddwd m0, m6 - pmaddwd m7, m6 - packssdw m0, m7 - - movhlps m8, m2 - movq m7, m2 - punpcklbw m8, m9 - punpcklbw m7, m9 - pmaddwd m8, m5 - pmaddwd m7, m5 - packssdw m7, m8 - - paddw m0, m7 - -%ifidn %1,pp - paddw m0, m4 - psraw m0, 6 - packuswb m0, m0 - - movd [r2], m0 - pextrw r6d, m0, 2 - mov [r2 + 4], r6w -%elifidn %1,ps - psubw m0, m4 - movh [r2], m0 - pshufd m0, m0, 2 - movd [r2 + 8], m0 -%endif - - lea r0, [r0 + 4 * r1] - - movq m0, [r0] - punpcklbw m3, m0 - - movhlps m8, m1 - punpcklbw m1, m9 - punpcklbw m8, m9 - pmaddwd m1, m6 - pmaddwd m8, m6 - packssdw m1, m8 - - movhlps m8, m3 - movq m7, m3 - punpcklbw m8, m9 - punpcklbw m7, m9 - pmaddwd m8, m5 - pmaddwd m7, m5 - packssdw m7, m8 - - paddw m1, m7 - -%ifidn %1,pp - paddw m1, m4 - psraw m1, 6 - packuswb m1, m1 - - movd [r2 + r3], m1 - pextrw r6d, m1, 2 - mov [r2 + r3 + 4], r6w -%elifidn %1,ps - psubw m1, m4 - movh [r2 + r3], m1 - pshufd m1, m1, 2 - movd [r2 + r3 + 8], m1 -%endif - - movq m1, [r0 + r1] - punpcklbw m7, m0, m1 - - movhlps m8, m2 - punpcklbw m2, m9 - punpcklbw m8, m9 - pmaddwd m2, m6 - pmaddwd m8, m6 - packssdw m2, m8 - - movhlps m8, m7 - punpcklbw m7, m9 - punpcklbw m8, m9 - pmaddwd m7, m5 - pmaddwd m8, m5 - packssdw m7, m8 - - paddw m2, m7 - lea r2, [r2 + 2 * r3] - -%ifidn %1,pp - paddw m2, m4 - psraw m2, 6 - packuswb m2, m2 - movd [r2], m2 - pextrw r6d, m2, 2 - mov [r2 + 4], r6w -%elifidn %1,ps - psubw m2, m4 - movh [r2], m2 - pshufd m2, m2, 2 - movd [r2 + 8], m2 -%endif - - movq m2, [r0 + 2 * r1] - punpcklbw m1, m2 - - movhlps m8, m3 - punpcklbw m3, m9 - punpcklbw m8, m9 - pmaddwd m3, m6 - pmaddwd m8, m6 - packssdw m3, m8 - - movhlps m8, m1 - punpcklbw m1, m9 - punpcklbw m8, m9 - pmaddwd m1, m5 - pmaddwd m8, m5 - packssdw m1, m8 - - paddw m3, m1 - -%ifidn %1,pp - paddw m3, m4 - psraw m3, 6 - packuswb m3, m3 - - movd [r2 + r3], m3 - pextrw r6d, m3, 2 - mov [r2 + r3 + 4], r6w -%elifidn %1,ps - psubw m3, m4 - movh [r2 + r3], m3 - pshufd m3, m3, 2 - movd [r2 + r3 + 8], m3 -%endif - -%if x < %2/4 - lea r2, [r2 + 2 * r3] -%endif - -%assign x x+1 -%endrep - RET - -%endmacro - -%if ARCH_X86_64 - FILTER_V4_W6_H4_sse2 pp, 8 - FILTER_V4_W6_H4_sse2 pp, 16 - FILTER_V4_W6_H4_sse2 ps, 8 - FILTER_V4_W6_H4_sse2 ps, 16 -%endif - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_%1_8x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W8_sse2 2 -INIT_XMM sse2 -cglobal interp_4tap_vert_%1_8x%2, 4, 7, 12 - mov r4d, r4m - sub r0, r1 - shl r4d, 5 - pxor m9, m9 - -%ifidn %1,pp - mova m4, [pw_32] -%elifidn %1,ps - mova m4, [pw_2000] - add r3d, r3d -%endif - -%ifdef PIC - lea r6, [tab_ChromaCoeffV] - mova m6, [r6 + r4] - mova m5, [r6 + r4 + 16] -%else - mova m6, [tab_ChromaCoeffV + r4] - mova m5, [tab_ChromaCoeffV + r4 + 16] -%endif - - movq m0, [r0] - movq m1, [r0 + r1] - movq m2, [r0 + 2 * r1] - lea r5, [r0 + 2 * r1] - movq m3, [r5 + r1] - - punpcklbw m0, m1 - punpcklbw m7, m2, m3 - - movhlps m8, m0 - punpcklbw m0, m9 - punpcklbw m8, m9 - pmaddwd m0, m6 - pmaddwd m8, m6 - packssdw m0, m8 - - movhlps m8, m7 - punpcklbw m7, m9 - punpcklbw m8, m9 - pmaddwd m7, m5 - pmaddwd m8, m5 - packssdw m7, m8 - - paddw m0, m7 - -%ifidn %1,pp - paddw m0, m4 - psraw m0, 6 -%elifidn %1,ps - psubw m0, m4 - movu [r2], m0 -%endif - - movq m11, [r0 + 4 * r1] - - punpcklbw m1, m2 - punpcklbw m7, m3, m11 - - movhlps m8, m1 - punpcklbw m1, m9 - punpcklbw m8, m9 - pmaddwd m1, m6 - pmaddwd m8, m6 - packssdw m1, m8 - - movhlps m8, m7 - punpcklbw m7, m9 - punpcklbw m8, m9 - pmaddwd m7, m5 - pmaddwd m8, m5 - packssdw m7, m8 - - paddw m1, m7 - -%ifidn %1,pp - paddw m1, m4 - psraw m1, 6 - packuswb m1, m0 - - movhps [r2], m1 - movh [r2 + r3], m1 -%elifidn %1,ps - psubw m1, m4 - movu [r2 + r3], m1 -%endif -%if %2 == 2 ;end of 8x2 - RET - -%else - lea r6, [r0 + 4 * r1] - movq m1, [r6 + r1] - - punpcklbw m2, m3 - punpcklbw m7, m11, m1 - - movhlps m8, m2 - punpcklbw m2, m9 - punpcklbw m8, m9 - pmaddwd m2, m6 - pmaddwd m8, m6 - packssdw m2, m8 - - movhlps m8, m7 - punpcklbw m7, m9 - punpcklbw m8, m9 - pmaddwd m7, m5 - pmaddwd m8, m5 - packssdw m7, m8 - - paddw m2, m7 - -%ifidn %1,pp - paddw m2, m4 - psraw m2, 6 -%elifidn %1,ps - psubw m2, m4 - movu [r2 + 2 * r3], m2 -%endif - - movq m10, [r6 + 2 * r1] - - punpcklbw m3, m11 - punpcklbw m7, m1, m10 - - movhlps m8, m3 - punpcklbw m3, m9 - punpcklbw m8, m9 - pmaddwd m3, m6 - pmaddwd m8, m6 - packssdw m3, m8 - - movhlps m8, m7 - punpcklbw m7, m9 - punpcklbw m8, m9 - pmaddwd m7, m5 - pmaddwd m8, m5 - packssdw m7, m8 - - paddw m3, m7 - lea r5, [r2 + 2 * r3] - -%ifidn %1,pp - paddw m3, m4 - psraw m3, 6 - packuswb m3, m2 - - movhps [r2 + 2 * r3], m3 - movh [r5 + r3], m3 -%elifidn %1,ps - psubw m3, m4 - movu [r5 + r3], m3 -%endif -%if %2 == 4 ;end of 8x4 - RET - -%else - lea r6, [r6 + 2 * r1] - movq m3, [r6 + r1] - - punpcklbw m11, m1 - punpcklbw m7, m10, m3 - - movhlps m8, m11 - punpcklbw m11, m9 - punpcklbw m8, m9 - pmaddwd m11, m6 - pmaddwd m8, m6 - packssdw m11, m8 - - movhlps m8, m7 - punpcklbw m7, m9 - punpcklbw m8, m9 - pmaddwd m7, m5 - pmaddwd m8, m5 - packssdw m7, m8 - - paddw m11, m7 - -%ifidn %1, pp - paddw m11, m4 - psraw m11, 6 -%elifidn %1,ps - psubw m11, m4 - movu [r2 + 4 * r3], m11 -%endif - - movq m7, [r0 + 8 * r1] - - punpcklbw m1, m10 - punpcklbw m3, m7 - - movhlps m8, m1 - punpcklbw m1, m9 - punpcklbw m8, m9 - pmaddwd m1, m6 - pmaddwd m8, m6 - packssdw m1, m8 - - movhlps m8, m3 - punpcklbw m3, m9 - punpcklbw m8, m9 - pmaddwd m3, m5 - pmaddwd m8, m5 - packssdw m3, m8 - - paddw m1, m3 - lea r5, [r2 + 4 * r3] - -%ifidn %1,pp - paddw m1, m4 - psraw m1, 6 - packuswb m1, m11 - - movhps [r2 + 4 * r3], m1 - movh [r5 + r3], m1 -%elifidn %1,ps - psubw m1, m4 - movu [r5 + r3], m1 -%endif -%if %2 == 6 - RET - -%else - %error INVALID macro argument, only 2, 4 or 6! -%endif -%endif -%endif -%endmacro - -%if ARCH_X86_64 - FILTER_V4_W8_sse2 pp, 2 - FILTER_V4_W8_sse2 pp, 4 - FILTER_V4_W8_sse2 pp, 6 - FILTER_V4_W8_sse2 ps, 2 - FILTER_V4_W8_sse2 ps, 4 - FILTER_V4_W8_sse2 ps, 6 -%endif - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_%1_8x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W8_H8_H16_H32_sse2 2 -INIT_XMM sse2 -cglobal interp_4tap_vert_%1_8x%2, 4, 6, 11 - mov r4d, r4m - sub r0, r1 - shl r4d, 5 - pxor m9, m9 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - mova m6, [r5 + r4] - mova m5, [r5 + r4 + 16] -%else - mova m6, [tab_ChromaCoeff + r4] - mova m5, [tab_ChromaCoeff + r4 + 16] -%endif - -%ifidn %1,pp - mova m4, [pw_32] -%elifidn %1,ps - mova m4, [pw_2000] - add r3d, r3d -%endif - - lea r5, [r1 * 3] - -%assign x 1 -%rep %2/4 - movq m0, [r0] - movq m1, [r0 + r1] - movq m2, [r0 + 2 * r1] - movq m3, [r0 + r5] - - punpcklbw m0, m1 - punpcklbw m1, m2 - punpcklbw m2, m3 - - movhlps m7, m0 - punpcklbw m0, m9 - punpcklbw m7, m9 - pmaddwd m0, m6 - pmaddwd m7, m6 - packssdw m0, m7 - - movhlps m8, m2 - movq m7, m2 - punpcklbw m8, m9 - punpcklbw m7, m9 - pmaddwd m8, m5 - pmaddwd m7, m5 - packssdw m7, m8 - - paddw m0, m7 - -%ifidn %1,pp - paddw m0, m4 - psraw m0, 6 -%elifidn %1,ps - psubw m0, m4 - movu [r2], m0 -%endif - - lea r0, [r0 + 4 * r1] - movq m10, [r0] - punpcklbw m3, m10 - - movhlps m8, m1 - punpcklbw m1, m9 - punpcklbw m8, m9 - pmaddwd m1, m6 - pmaddwd m8, m6 - packssdw m1, m8 - - movhlps m8, m3 - movq m7, m3 - punpcklbw m8, m9 - punpcklbw m7, m9 - pmaddwd m8, m5 - pmaddwd m7, m5 - packssdw m7, m8 - - paddw m1, m7 - -%ifidn %1,pp - paddw m1, m4 - psraw m1, 6 - - packuswb m0, m1 - movh [r2], m0 - movhps [r2 + r3], m0 -%elifidn %1,ps - psubw m1, m4 - movu [r2 + r3], m1 -%endif - - movq m1, [r0 + r1] - punpcklbw m10, m1 - - movhlps m8, m2 - punpcklbw m2, m9 - punpcklbw m8, m9 - pmaddwd m2, m6 - pmaddwd m8, m6 - packssdw m2, m8 - - movhlps m8, m10 - punpcklbw m10, m9 - punpcklbw m8, m9 - pmaddwd m10, m5 - pmaddwd m8, m5 - packssdw m10, m8 - - paddw m2, m10 - lea r2, [r2 + 2 * r3] - -%ifidn %1,pp - paddw m2, m4 - psraw m2, 6 -%elifidn %1,ps - psubw m2, m4 - movu [r2], m2 -%endif - - movq m7, [r0 + 2 * r1] - punpcklbw m1, m7 - - movhlps m8, m3 - punpcklbw m3, m9 - punpcklbw m8, m9 - pmaddwd m3, m6 - pmaddwd m8, m6 - packssdw m3, m8 - - movhlps m8, m1 - punpcklbw m1, m9 - punpcklbw m8, m9 - pmaddwd m1, m5 - pmaddwd m8, m5 - packssdw m1, m8 - - paddw m3, m1 - -%ifidn %1,pp - paddw m3, m4 - psraw m3, 6 - - packuswb m2, m3 - movh [r2], m2 - movhps [r2 + r3], m2 -%elifidn %1,ps - psubw m3, m4 - movu [r2 + r3], m3 -%endif - -%if x < %2/4 - lea r2, [r2 + 2 * r3] -%endif -%endrep - RET -%endmacro - -%if ARCH_X86_64 - FILTER_V4_W8_H8_H16_H32_sse2 pp, 8 - FILTER_V4_W8_H8_H16_H32_sse2 pp, 16 - FILTER_V4_W8_H8_H16_H32_sse2 pp, 32 - - FILTER_V4_W8_H8_H16_H32_sse2 pp, 12 - FILTER_V4_W8_H8_H16_H32_sse2 pp, 64 - - FILTER_V4_W8_H8_H16_H32_sse2 ps, 8 - FILTER_V4_W8_H8_H16_H32_sse2 ps, 16 - FILTER_V4_W8_H8_H16_H32_sse2 ps, 32 - - FILTER_V4_W8_H8_H16_H32_sse2 ps, 12 - FILTER_V4_W8_H8_H16_H32_sse2 ps, 64 -%endif - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_%1_12x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W12_H2_sse2 2 -INIT_XMM sse2 -cglobal interp_4tap_vert_%1_12x%2, 4, 6, 11 - mov r4d, r4m - sub r0, r1 - shl r4d, 5 - pxor m9, m9 - -%ifidn %1,pp - mova m6, [pw_32] -%elifidn %1,ps - mova m6, [pw_2000] - add r3d, r3d -%endif - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - mova m1, [r5 + r4] - mova m0, [r5 + r4 + 16] -%else - mova m1, [tab_ChromaCoeffV + r4] - mova m0, [tab_ChromaCoeffV + r4 + 16] -%endif - -%assign x 1 -%rep %2/2 - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - movhlps m8, m4 - punpcklbw m4, m9 - punpcklbw m8, m9 - pmaddwd m4, m1 - pmaddwd m8, m1 - packssdw m4, m8 - - movhlps m8, m2 - punpcklbw m2, m9 - punpcklbw m8, m9 - pmaddwd m2, m1 - pmaddwd m8, m1 - packssdw m2, m8 - - lea r0, [r0 + 2 * r1] - movu m5, [r0] - movu m7, [r0 + r1] - - punpcklbw m10, m5, m7 - movhlps m8, m10 - punpcklbw m10, m9 - punpcklbw m8, m9 - pmaddwd m10, m0 - pmaddwd m8, m0 - packssdw m10, m8 - - paddw m4, m10 - - punpckhbw m10, m5, m7 - movhlps m8, m10 - punpcklbw m10, m9 - punpcklbw m8, m9 - pmaddwd m10, m0 - pmaddwd m8, m0 - packssdw m10, m8 - - paddw m2, m10 - -%ifidn %1,pp - paddw m4, m6 - psraw m4, 6 - paddw m2, m6 - psraw m2, 6 - - packuswb m4, m2 - movh [r2], m4 - psrldq m4, 8 - movd [r2 + 8], m4 -%elifidn %1,ps - psubw m4, m6 - psubw m2, m6 - movu [r2], m4 - movh [r2 + 16], m2 -%endif - - punpcklbw m4, m3, m5 - punpckhbw m3, m5 - - movhlps m8, m4 - punpcklbw m4, m9 - punpcklbw m8, m9 - pmaddwd m4, m1 - pmaddwd m8, m1 - packssdw m4, m8 - - movhlps m8, m4 - punpcklbw m3, m9 - punpcklbw m8, m9 - pmaddwd m3, m1 - pmaddwd m8, m1 - packssdw m3, m8 - - movu m5, [r0 + 2 * r1] - punpcklbw m2, m7, m5 - punpckhbw m7, m5 - - movhlps m8, m2 - punpcklbw m2, m9 - punpcklbw m8, m9 - pmaddwd m2, m0 - pmaddwd m8, m0 - packssdw m2, m8 - - movhlps m8, m7 - punpcklbw m7, m9 - punpcklbw m8, m9 - pmaddwd m7, m0 - pmaddwd m8, m0 - packssdw m7, m8 - - paddw m4, m2 - paddw m3, m7 - -%ifidn %1,pp - paddw m4, m6 - psraw m4, 6 - paddw m3, m6 - psraw m3, 6 - - packuswb m4, m3 - movh [r2 + r3], m4 - psrldq m4, 8 - movd [r2 + r3 + 8], m4 -%elifidn %1,ps - psubw m4, m6 - psubw m3, m6 - movu [r2 + r3], m4 - movh [r2 + r3 + 16], m3 -%endif - -%if x < %2/2 - lea r2, [r2 + 2 * r3] -%endif -%assign x x+1 -%endrep - RET - -%endmacro - -%if ARCH_X86_64 - FILTER_V4_W12_H2_sse2 pp, 16 - FILTER_V4_W12_H2_sse2 pp, 32 - FILTER_V4_W12_H2_sse2 ps, 16 - FILTER_V4_W12_H2_sse2 ps, 32 -%endif - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_%1_16x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W16_H2_sse2 2 -INIT_XMM sse2 -cglobal interp_4tap_vert_%1_16x%2, 4, 6, 11 - mov r4d, r4m - sub r0, r1 - shl r4d, 5 - pxor m9, m9 - -%ifidn %1,pp - mova m6, [pw_32] -%elifidn %1,ps - mova m6, [pw_2000] - add r3d, r3d -%endif - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - mova m1, [r5 + r4] - mova m0, [r5 + r4 + 16] -%else - mova m1, [tab_ChromaCoeffV + r4] - mova m0, [tab_ChromaCoeffV + r4 + 16] -%endif - -%assign x 1 -%rep %2/2 - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - movhlps m8, m4 - punpcklbw m4, m9 - punpcklbw m8, m9 - pmaddwd m4, m1 - pmaddwd m8, m1 - packssdw m4, m8 - - movhlps m8, m2 - punpcklbw m2, m9 - punpcklbw m8, m9 - pmaddwd m2, m1 - pmaddwd m8, m1 - packssdw m2, m8 - - lea r0, [r0 + 2 * r1] - movu m5, [r0] - movu m10, [r0 + r1] - - punpckhbw m7, m5, m10 - movhlps m8, m7 - punpcklbw m7, m9 - punpcklbw m8, m9 - pmaddwd m7, m0 - pmaddwd m8, m0 - packssdw m7, m8 - paddw m2, m7 - - punpcklbw m7, m5, m10 - movhlps m8, m7 - punpcklbw m7, m9 - punpcklbw m8, m9 - pmaddwd m7, m0 - pmaddwd m8, m0 - packssdw m7, m8 - paddw m4, m7 - -%ifidn %1,pp - paddw m4, m6 - psraw m4, 6 - paddw m2, m6 - psraw m2, 6 - - packuswb m4, m2 - movu [r2], m4 -%elifidn %1,ps - psubw m4, m6 - psubw m2, m6 - movu [r2], m4 - movu [r2 + 16], m2 -%endif - - punpcklbw m4, m3, m5 - punpckhbw m3, m5 - - movhlps m8, m4 - punpcklbw m4, m9 - punpcklbw m8, m9 - pmaddwd m4, m1 - pmaddwd m8, m1 - packssdw m4, m8 - - movhlps m8, m3 - punpcklbw m3, m9 - punpcklbw m8, m9 - pmaddwd m3, m1 - pmaddwd m8, m1 - packssdw m3, m8 - - movu m5, [r0 + 2 * r1] - - punpcklbw m2, m10, m5 - punpckhbw m10, m5 - - movhlps m8, m2 - punpcklbw m2, m9 - punpcklbw m8, m9 - pmaddwd m2, m0 - pmaddwd m8, m0 - packssdw m2, m8 - - movhlps m8, m10 - punpcklbw m10, m9 - punpcklbw m8, m9 - pmaddwd m10, m0 - pmaddwd m8, m0 - packssdw m10, m8 - - paddw m4, m2 - paddw m3, m10 - -%ifidn %1,pp - paddw m4, m6 - psraw m4, 6 - paddw m3, m6 - psraw m3, 6 - - packuswb m4, m3 - movu [r2 + r3], m4 -%elifidn %1,ps - psubw m4, m6 - psubw m3, m6 - movu [r2 + r3], m4 - movu [r2 + r3 + 16], m3 -%endif - -%if x < %2/2 - lea r2, [r2 + 2 * r3] -%endif -%assign x x+1 -%endrep - RET - -%endmacro - -%if ARCH_X86_64 - FILTER_V4_W16_H2_sse2 pp, 4 - FILTER_V4_W16_H2_sse2 pp, 8 - FILTER_V4_W16_H2_sse2 pp, 12 - FILTER_V4_W16_H2_sse2 pp, 16 - FILTER_V4_W16_H2_sse2 pp, 32 - - FILTER_V4_W16_H2_sse2 pp, 24 - FILTER_V4_W16_H2_sse2 pp, 64 - - FILTER_V4_W16_H2_sse2 ps, 4 - FILTER_V4_W16_H2_sse2 ps, 8 - FILTER_V4_W16_H2_sse2 ps, 12 - FILTER_V4_W16_H2_sse2 ps, 16 - FILTER_V4_W16_H2_sse2 ps, 32 - - FILTER_V4_W16_H2_sse2 ps, 24 - FILTER_V4_W16_H2_sse2 ps, 64 -%endif - -;----------------------------------------------------------------------------- -;void interp_4tap_vert_%1_24%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W24_sse2 2 -INIT_XMM sse2 -cglobal interp_4tap_vert_%1_24x%2, 4, 6, 11 - mov r4d, r4m - sub r0, r1 - shl r4d, 5 - pxor m9, m9 - -%ifidn %1,pp - mova m6, [pw_32] -%elifidn %1,ps - mova m6, [pw_2000] - add r3d, r3d -%endif - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - mova m1, [r5 + r4] - mova m0, [r5 + r4 + 16] -%else - mova m1, [tab_ChromaCoeffV + r4] - mova m0, [tab_ChromaCoeffV + r4 + 16] -%endif - -%assign x 1 -%rep %2/2 - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - movhlps m8, m4 - punpcklbw m4, m9 - punpcklbw m8, m9 - pmaddwd m4, m1 - pmaddwd m8, m1 - packssdw m4, m8 - - movhlps m8, m2 - punpcklbw m2, m9 - punpcklbw m8, m9 - pmaddwd m2, m1 - pmaddwd m8, m1 - packssdw m2, m8 - - lea r5, [r0 + 2 * r1] - movu m5, [r5] - movu m10, [r5 + r1] - punpcklbw m7, m5, m10 - - movhlps m8, m7 - punpcklbw m7, m9 - punpcklbw m8, m9 - pmaddwd m7, m0 - pmaddwd m8, m0 - packssdw m7, m8 - paddw m4, m7 - - punpckhbw m7, m5, m10 - - movhlps m8, m7 - punpcklbw m7, m9 - punpcklbw m8, m9 - pmaddwd m7, m0 - pmaddwd m8, m0 - packssdw m7, m8 - - paddw m2, m7 - -%ifidn %1,pp - paddw m4, m6 - psraw m4, 6 - paddw m2, m6 - psraw m2, 6 - - packuswb m4, m2 - movu [r2], m4 -%elifidn %1,ps - psubw m4, m6 - psubw m2, m6 - movu [r2], m4 - movu [r2 + 16], m2 -%endif - - punpcklbw m4, m3, m5 - punpckhbw m3, m5 - - movhlps m8, m4 - punpcklbw m4, m9 - punpcklbw m8, m9 - pmaddwd m4, m1 - pmaddwd m8, m1 - packssdw m4, m8 - - movhlps m8, m3 - punpcklbw m3, m9 - punpcklbw m8, m9 - pmaddwd m3, m1 - pmaddwd m8, m1 - packssdw m3, m8 - - movu m2, [r5 + 2 * r1] - - punpcklbw m5, m10, m2 - punpckhbw m10, m2 - - movhlps m8, m5 - punpcklbw m5, m9 - punpcklbw m8, m9 - pmaddwd m5, m0 - pmaddwd m8, m0 - packssdw m5, m8 - - movhlps m8, m10 - punpcklbw m10, m9 - punpcklbw m8, m9 - pmaddwd m10, m0 - pmaddwd m8, m0 - packssdw m10, m8 - - paddw m4, m5 - paddw m3, m10 - -%ifidn %1,pp - paddw m4, m6 - psraw m4, 6 - paddw m3, m6 - psraw m3, 6 - - packuswb m4, m3 - movu [r2 + r3], m4 -%elifidn %1,ps - psubw m4, m6 - psubw m3, m6 - movu [r2 + r3], m4 - movu [r2 + r3 + 16], m3 -%endif - - movq m2, [r0 + 16] - movq m3, [r0 + r1 + 16] - movq m4, [r5 + 16] - movq m5, [r5 + r1 + 16] - - punpcklbw m2, m3 - punpcklbw m4, m5 - - movhlps m8, m4 - punpcklbw m4, m9 - punpcklbw m8, m9 - pmaddwd m4, m0 - pmaddwd m8, m0 - packssdw m4, m8 - - movhlps m8, m2 - punpcklbw m2, m9 - punpcklbw m8, m9 - pmaddwd m2, m1 - pmaddwd m8, m1 - packssdw m2, m8 - - paddw m2, m4 - -%ifidn %1,pp - paddw m2, m6 - psraw m2, 6 -%elifidn %1,ps - psubw m2, m6 - movu [r2 + 32], m2 -%endif - - movq m3, [r0 + r1 + 16] - movq m4, [r5 + 16] - movq m5, [r5 + r1 + 16] - movq m7, [r5 + 2 * r1 + 16] - - punpcklbw m3, m4 - punpcklbw m5, m7 - - movhlps m8, m5 - punpcklbw m5, m9 - punpcklbw m8, m9 - pmaddwd m5, m0 - pmaddwd m8, m0 - packssdw m5, m8 - - movhlps m8, m3 - punpcklbw m3, m9 - punpcklbw m8, m9 - pmaddwd m3, m1 - pmaddwd m8, m1 - packssdw m3, m8 - - paddw m3, m5 - -%ifidn %1,pp - paddw m3, m6 - psraw m3, 6 - - packuswb m2, m3 - movh [r2 + 16], m2 - movhps [r2 + r3 + 16], m2 -%elifidn %1,ps - psubw m3, m6 - movu [r2 + r3 + 32], m3 -%endif - -%if x < %2/2 - mov r0, r5 - lea r2, [r2 + 2 * r3] -%endif -%assign x x+1 -%endrep - RET - -%endmacro - -%if ARCH_X86_64 - FILTER_V4_W24_sse2 pp, 32 - FILTER_V4_W24_sse2 pp, 64 - FILTER_V4_W24_sse2 ps, 32 - FILTER_V4_W24_sse2 ps, 64 -%endif - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_%1_32x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W32_sse2 2 -INIT_XMM sse2 -cglobal interp_4tap_vert_%1_32x%2, 4, 6, 10 - mov r4d, r4m - sub r0, r1 - shl r4d, 5 - pxor m9, m9 - -%ifidn %1,pp - mova m6, [pw_32] -%elifidn %1,ps - mova m6, [pw_2000] - add r3d, r3d -%endif - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - mova m1, [r5 + r4] - mova m0, [r5 + r4 + 16] -%else - mova m1, [tab_ChromaCoeffV + r4] - mova m0, [tab_ChromaCoeffV + r4 + 16] -%endif - - mov r4d, %2 - -.loop: - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - movhlps m8, m4 - punpcklbw m4, m9 - punpcklbw m8, m9 - pmaddwd m4, m1 - pmaddwd m8, m1 - packssdw m4, m8 - - movhlps m8, m2 - punpcklbw m2, m9 - punpcklbw m8, m9 - pmaddwd m2, m1 - pmaddwd m8, m1 - packssdw m2, m8 - - lea r5, [r0 + 2 * r1] - movu m3, [r5] - movu m5, [r5 + r1] - - punpcklbw m7, m3, m5 - punpckhbw m3, m5 - - movhlps m8, m7 - punpcklbw m7, m9 - punpcklbw m8, m9 - pmaddwd m7, m0 - pmaddwd m8, m0 - packssdw m7, m8 - - movhlps m8, m3 - punpcklbw m3, m9 - punpcklbw m8, m9 - pmaddwd m3, m0 - pmaddwd m8, m0 - packssdw m3, m8 - - paddw m4, m7 - paddw m2, m3 - -%ifidn %1,pp - paddw m4, m6 - psraw m4, 6 - paddw m2, m6 - psraw m2, 6 - - packuswb m4, m2 - movu [r2], m4 -%elifidn %1,ps - psubw m4, m6 - psubw m2, m6 - movu [r2], m4 - movu [r2 + 16], m2 -%endif - - movu m2, [r0 + 16] - movu m3, [r0 + r1 + 16] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - movhlps m8, m4 - punpcklbw m4, m9 - punpcklbw m8, m9 - pmaddwd m4, m1 - pmaddwd m8, m1 - packssdw m4, m8 - - movhlps m8, m2 - punpcklbw m2, m9 - punpcklbw m8, m9 - pmaddwd m2, m1 - pmaddwd m8, m1 - packssdw m2, m8 - - movu m3, [r5 + 16] - movu m5, [r5 + r1 + 16] - - punpcklbw m7, m3, m5 - punpckhbw m3, m5 - - movhlps m8, m7 - punpcklbw m7, m9 - punpcklbw m8, m9 - pmaddwd m7, m0 - pmaddwd m8, m0 - packssdw m7, m8 - - movhlps m8, m3 - punpcklbw m3, m9 - punpcklbw m8, m9 - pmaddwd m3, m0 - pmaddwd m8, m0 - packssdw m3, m8 - - paddw m4, m7 - paddw m2, m3 - -%ifidn %1,pp - paddw m4, m6 - psraw m4, 6 - paddw m2, m6 - psraw m2, 6 - - packuswb m4, m2 - movu [r2 + 16], m4 -%elifidn %1,ps - psubw m4, m6 - psubw m2, m6 - movu [r2 + 32], m4 - movu [r2 + 48], m2 -%endif - - lea r0, [r0 + r1] - lea r2, [r2 + r3] - dec r4 - jnz .loop - RET - -%endmacro - -%if ARCH_X86_64 - FILTER_V4_W32_sse2 pp, 8 - FILTER_V4_W32_sse2 pp, 16 - FILTER_V4_W32_sse2 pp, 24 - FILTER_V4_W32_sse2 pp, 32 - - FILTER_V4_W32_sse2 pp, 48 - FILTER_V4_W32_sse2 pp, 64 - - FILTER_V4_W32_sse2 ps, 8 - FILTER_V4_W32_sse2 ps, 16 - FILTER_V4_W32_sse2 ps, 24 - FILTER_V4_W32_sse2 ps, 32 - - FILTER_V4_W32_sse2 ps, 48 - FILTER_V4_W32_sse2 ps, 64 -%endif - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_%1_%2x%3(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W16n_H2_sse2 3 -INIT_XMM sse2 -cglobal interp_4tap_vert_%1_%2x%3, 4, 7, 11 - mov r4d, r4m - sub r0, r1 - shl r4d, 5 - pxor m9, m9 - -%ifidn %1,pp - mova m7, [pw_32] -%elifidn %1,ps - mova m7, [pw_2000] - add r3d, r3d -%endif - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - mova m1, [r5 + r4] - mova m0, [r5 + r4 + 16] -%else - mova m1, [tab_ChromaCoeffV + r4] - mova m0, [tab_ChromaCoeffV + r4 + 16] -%endif - - mov r4d, %3/2 - -.loop: - - mov r6d, %2/16 - -.loopW: - - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - movhlps m8, m4 - punpcklbw m4, m9 - punpcklbw m8, m9 - pmaddwd m4, m1 - pmaddwd m8, m1 - packssdw m4, m8 - - movhlps m8, m2 - punpcklbw m2, m9 - punpcklbw m8, m9 - pmaddwd m2, m1 - pmaddwd m8, m1 - packssdw m2, m8 - - lea r5, [r0 + 2 * r1] - movu m5, [r5] - movu m6, [r5 + r1] - - punpckhbw m10, m5, m6 - movhlps m8, m10 - punpcklbw m10, m9 - punpcklbw m8, m9 - pmaddwd m10, m0 - pmaddwd m8, m0 - packssdw m10, m8 - paddw m2, m10 - - punpcklbw m10, m5, m6 - movhlps m8, m10 - punpcklbw m10, m9 - punpcklbw m8, m9 - pmaddwd m10, m0 - pmaddwd m8, m0 - packssdw m10, m8 - paddw m4, m10 - -%ifidn %1,pp - paddw m4, m7 - psraw m4, 6 - paddw m2, m7 - psraw m2, 6 - - packuswb m4, m2 - movu [r2], m4 -%elifidn %1,ps - psubw m4, m7 - psubw m2, m7 - movu [r2], m4 - movu [r2 + 16], m2 -%endif - - punpcklbw m4, m3, m5 - punpckhbw m3, m5 - - movhlps m8, m4 - punpcklbw m4, m9 - punpcklbw m8, m9 - pmaddwd m4, m1 - pmaddwd m8, m1 - packssdw m4, m8 - - movhlps m8, m3 - punpcklbw m3, m9 - punpcklbw m8, m9 - pmaddwd m3, m1 - pmaddwd m8, m1 - packssdw m3, m8 - - movu m5, [r5 + 2 * r1] - - punpcklbw m2, m6, m5 - punpckhbw m6, m5 - - movhlps m8, m2 - punpcklbw m2, m9 - punpcklbw m8, m9 - pmaddwd m2, m0 - pmaddwd m8, m0 - packssdw m2, m8 - - movhlps m8, m6 - punpcklbw m6, m9 - punpcklbw m8, m9 - pmaddwd m6, m0 - pmaddwd m8, m0 - packssdw m6, m8 - - paddw m4, m2 - paddw m3, m6 - -%ifidn %1,pp - paddw m4, m7 - psraw m4, 6 - paddw m3, m7 - psraw m3, 6 - - packuswb m4, m3 - movu [r2 + r3], m4 - add r2, 16 -%elifidn %1,ps - psubw m4, m7 - psubw m3, m7 - movu [r2 + r3], m4 - movu [r2 + r3 + 16], m3 - add r2, 32 -%endif - - add r0, 16 - dec r6d - jnz .loopW - - lea r0, [r0 + r1 * 2 - %2] - -%ifidn %1,pp - lea r2, [r2 + r3 * 2 - %2] -%elifidn %1,ps - lea r2, [r2 + r3 * 2 - (%2 * 2)] -%endif - - dec r4d - jnz .loop - RET - -%endmacro - -%if ARCH_X86_64 - FILTER_V4_W16n_H2_sse2 pp, 64, 64 - FILTER_V4_W16n_H2_sse2 pp, 64, 32 - FILTER_V4_W16n_H2_sse2 pp, 64, 48 - FILTER_V4_W16n_H2_sse2 pp, 48, 64 - FILTER_V4_W16n_H2_sse2 pp, 64, 16 - FILTER_V4_W16n_H2_sse2 ps, 64, 64 - FILTER_V4_W16n_H2_sse2 ps, 64, 32 - FILTER_V4_W16n_H2_sse2 ps, 64, 48 - FILTER_V4_W16n_H2_sse2 ps, 48, 64 - FILTER_V4_W16n_H2_sse2 ps, 64, 16 -%endif - -;----------------------------------------------------------------------------- -;void interp_4tap_vert_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse4 -cglobal interp_4tap_vert_pp_2x4, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - lea r4, [r1 * 3] - lea r5, [r0 + 4 * r1] - pshufb m0, [tab_Cm] - mova m1, [pw_512] - - movd m2, [r0] - movd m3, [r0 + r1] - movd m4, [r0 + 2 * r1] - movd m5, [r0 + r4] - - punpcklbw m2, m3 - punpcklbw m6, m4, m5 - punpcklbw m2, m6 - - pmaddubsw m2, m0 - - movd m6, [r5] - - punpcklbw m3, m4 - punpcklbw m7, m5, m6 - punpcklbw m3, m7 - - pmaddubsw m3, m0 - - phaddw m2, m3 - - pmulhrsw m2, m1 - - movd m7, [r5 + r1] - - punpcklbw m4, m5 - punpcklbw m3, m6, m7 - punpcklbw m4, m3 - - pmaddubsw m4, m0 - - movd m3, [r5 + 2 * r1] - - punpcklbw m5, m6 - punpcklbw m7, m3 - punpcklbw m5, m7 - - pmaddubsw m5, m0 - - phaddw m4, m5 - - pmulhrsw m4, m1 - packuswb m2, m4 - - pextrw [r2], m2, 0 - pextrw [r2 + r3], m2, 2 - lea r2, [r2 + 2 * r3] - pextrw [r2], m2, 4 - pextrw [r2 + r3], m2, 6 - - RET - -%macro FILTER_VER_CHROMA_AVX2_2x4 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_2x4, 4, 6, 2 - mov r4d, r4m - shl r4d, 5 - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeff_V] - add r5, r4 -%else - lea r5, [tab_ChromaCoeff_V + r4] -%endif - - lea r4, [r1 * 3] - - pinsrw xm1, [r0], 0 - pinsrw xm1, [r0 + r1], 1 - pinsrw xm1, [r0 + r1 * 2], 2 - pinsrw xm1, [r0 + r4], 3 - lea r0, [r0 + r1 * 4] - pinsrw xm1, [r0], 4 - pinsrw xm1, [r0 + r1], 5 - pinsrw xm1, [r0 + r1 * 2], 6 - - pshufb xm0, xm1, [interp_vert_shuf] - pshufb xm1, [interp_vert_shuf + 32] - vinserti128 m0, m0, xm1, 1 - pmaddubsw m0, [r5] - vextracti128 xm1, m0, 1 - paddw xm0, xm1 -%ifidn %1,pp - pmulhrsw xm0, [pw_512] - packuswb xm0, xm0 - lea r4, [r3 * 3] - pextrw [r2], xm0, 0 - pextrw [r2 + r3], xm0, 1 - pextrw [r2 + r3 * 2], xm0, 2 - pextrw [r2 + r4], xm0, 3 -%else - add r3d, r3d - lea r4, [r3 * 3] - psubw xm0, [pw_2000] - movd [r2], xm0 - pextrd [r2 + r3], xm0, 1 - pextrd [r2 + r3 * 2], xm0, 2 - pextrd [r2 + r4], xm0, 3 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_2x4 pp - FILTER_VER_CHROMA_AVX2_2x4 ps - -%macro FILTER_VER_CHROMA_AVX2_2x8 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_2x8, 4, 6, 2 - mov r4d, r4m - shl r4d, 6 - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - - pinsrw xm1, [r0], 0 - pinsrw xm1, [r0 + r1], 1 - pinsrw xm1, [r0 + r1 * 2], 2 - pinsrw xm1, [r0 + r4], 3 - lea r0, [r0 + r1 * 4] - pinsrw xm1, [r0], 4 - pinsrw xm1, [r0 + r1], 5 - pinsrw xm1, [r0 + r1 * 2], 6 - pinsrw xm1, [r0 + r4], 7 - movhlps xm0, xm1 - lea r0, [r0 + r1 * 4] - pinsrw xm0, [r0], 4 - pinsrw xm0, [r0 + r1], 5 - pinsrw xm0, [r0 + r1 * 2], 6 - vinserti128 m1, m1, xm0, 1 - - pshufb m0, m1, [interp_vert_shuf] - pshufb m1, [interp_vert_shuf + 32] - pmaddubsw m0, [r5] - pmaddubsw m1, [r5 + 1 * mmsize] - paddw m0, m1 -%ifidn %1,pp - pmulhrsw m0, [pw_512] - vextracti128 xm1, m0, 1 - packuswb xm0, xm1 - lea r4, [r3 * 3] - pextrw [r2], xm0, 0 - pextrw [r2 + r3], xm0, 1 - pextrw [r2 + r3 * 2], xm0, 2 - pextrw [r2 + r4], xm0, 3 - lea r2, [r2 + r3 * 4] - pextrw [r2], xm0, 4 - pextrw [r2 + r3], xm0, 5 - pextrw [r2 + r3 * 2], xm0, 6 - pextrw [r2 + r4], xm0, 7 -%else - add r3d, r3d - lea r4, [r3 * 3] - psubw m0, [pw_2000] - vextracti128 xm1, m0, 1 - movd [r2], xm0 - pextrd [r2 + r3], xm0, 1 - pextrd [r2 + r3 * 2], xm0, 2 - pextrd [r2 + r4], xm0, 3 - lea r2, [r2 + r3 * 4] - movd [r2], xm1 - pextrd [r2 + r3], xm1, 1 - pextrd [r2 + r3 * 2], xm1, 2 - pextrd [r2 + r4], xm1, 3 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_2x8 pp - FILTER_VER_CHROMA_AVX2_2x8 ps - -%macro FILTER_VER_CHROMA_AVX2_2x16 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_2x16, 4, 6, 3 - mov r4d, r4m - shl r4d, 6 - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - - movd xm1, [r0] - pinsrw xm1, [r0 + r1], 1 - pinsrw xm1, [r0 + r1 * 2], 2 - pinsrw xm1, [r0 + r4], 3 - lea r0, [r0 + r1 * 4] - pinsrw xm1, [r0], 4 - pinsrw xm1, [r0 + r1], 5 - pinsrw xm1, [r0 + r1 * 2], 6 - pinsrw xm1, [r0 + r4], 7 - lea r0, [r0 + r1 * 4] - pinsrw xm0, [r0], 4 - pinsrw xm0, [r0 + r1], 5 - pinsrw xm0, [r0 + r1 * 2], 6 - pinsrw xm0, [r0 + r4], 7 - punpckhqdq xm0, xm1, xm0 - vinserti128 m1, m1, xm0, 1 - - pshufb m2, m1, [interp_vert_shuf] - pshufb m1, [interp_vert_shuf + 32] - pmaddubsw m2, [r5] - pmaddubsw m1, [r5 + 1 * mmsize] - paddw m2, m1 - - lea r0, [r0 + r1 * 4] - pinsrw xm1, [r0], 4 - pinsrw xm1, [r0 + r1], 5 - pinsrw xm1, [r0 + r1 * 2], 6 - pinsrw xm1, [r0 + r4], 7 - punpckhqdq xm1, xm0, xm1 - lea r0, [r0 + r1 * 4] - pinsrw xm0, [r0], 4 - pinsrw xm0, [r0 + r1], 5 - pinsrw xm0, [r0 + r1 * 2], 6 - punpckhqdq xm0, xm1, xm0 - vinserti128 m1, m1, xm0, 1 - - pshufb m0, m1, [interp_vert_shuf] - pshufb m1, [interp_vert_shuf + 32] - pmaddubsw m0, [r5] - pmaddubsw m1, [r5 + 1 * mmsize] - paddw m0, m1 -%ifidn %1,pp - mova m1, [pw_512] - pmulhrsw m2, m1 - pmulhrsw m0, m1 - packuswb m2, m0 - lea r4, [r3 * 3] - pextrw [r2], xm2, 0 - pextrw [r2 + r3], xm2, 1 - pextrw [r2 + r3 * 2], xm2, 2 - pextrw [r2 + r4], xm2, 3 - vextracti128 xm0, m2, 1 - lea r2, [r2 + r3 * 4] - pextrw [r2], xm0, 0 - pextrw [r2 + r3], xm0, 1 - pextrw [r2 + r3 * 2], xm0, 2 - pextrw [r2 + r4], xm0, 3 - lea r2, [r2 + r3 * 4] - pextrw [r2], xm2, 4 - pextrw [r2 + r3], xm2, 5 - pextrw [r2 + r3 * 2], xm2, 6 - pextrw [r2 + r4], xm2, 7 - lea r2, [r2 + r3 * 4] - pextrw [r2], xm0, 4 - pextrw [r2 + r3], xm0, 5 - pextrw [r2 + r3 * 2], xm0, 6 - pextrw [r2 + r4], xm0, 7 -%else - add r3d, r3d - lea r4, [r3 * 3] - vbroadcasti128 m1, [pw_2000] - psubw m2, m1 - psubw m0, m1 - vextracti128 xm1, m2, 1 - movd [r2], xm2 - pextrd [r2 + r3], xm2, 1 - pextrd [r2 + r3 * 2], xm2, 2 - pextrd [r2 + r4], xm2, 3 - lea r2, [r2 + r3 * 4] - movd [r2], xm1 - pextrd [r2 + r3], xm1, 1 - pextrd [r2 + r3 * 2], xm1, 2 - pextrd [r2 + r4], xm1, 3 - vextracti128 xm1, m0, 1 - lea r2, [r2 + r3 * 4] - movd [r2], xm0 - pextrd [r2 + r3], xm0, 1 - pextrd [r2 + r3 * 2], xm0, 2 - pextrd [r2 + r4], xm0, 3 - lea r2, [r2 + r3 * 4] - movd [r2], xm1 - pextrd [r2 + r3], xm1, 1 - pextrd [r2 + r3 * 2], xm1, 2 - pextrd [r2 + r4], xm1, 3 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_2x16 pp - FILTER_VER_CHROMA_AVX2_2x16 ps - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W2_H4 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_pp_2x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m0, [tab_Cm] - - mova m1, [pw_512] - - mov r4d, %2 - lea r5, [3 * r1] - -.loop: - movd m2, [r0] - movd m3, [r0 + r1] - movd m4, [r0 + 2 * r1] - movd m5, [r0 + r5] - - punpcklbw m2, m3 - punpcklbw m6, m4, m5 - punpcklbw m2, m6 - - pmaddubsw m2, m0 - - lea r0, [r0 + 4 * r1] - movd m6, [r0] - - punpcklbw m3, m4 - punpcklbw m7, m5, m6 - punpcklbw m3, m7 - - pmaddubsw m3, m0 - - phaddw m2, m3 - - pmulhrsw m2, m1 - - movd m7, [r0 + r1] - - punpcklbw m4, m5 - punpcklbw m3, m6, m7 - punpcklbw m4, m3 - - pmaddubsw m4, m0 - - movd m3, [r0 + 2 * r1] - - punpcklbw m5, m6 - punpcklbw m7, m3 - punpcklbw m5, m7 - - pmaddubsw m5, m0 - - phaddw m4, m5 - - pmulhrsw m4, m1 - packuswb m2, m4 - - pextrw [r2], m2, 0 - pextrw [r2 + r3], m2, 2 - lea r2, [r2 + 2 * r3] - pextrw [r2], m2, 4 - pextrw [r2 + r3], m2, 6 - - lea r2, [r2 + 2 * r3] - - sub r4, 4 - jnz .loop - RET -%endmacro - - FILTER_V4_W2_H4 2, 8 - FILTER_V4_W2_H4 2, 16 - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse4 -cglobal interp_4tap_vert_pp_4x2, 4, 6, 6 - - mov r4d, r4m - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m0, [tab_Cm] - lea r5, [r0 + 2 * r1] - - movd m2, [r0] - movd m3, [r0 + r1] - movd m4, [r5] - movd m5, [r5 + r1] - - punpcklbw m2, m3 - punpcklbw m1, m4, m5 - punpcklbw m2, m1 - - pmaddubsw m2, m0 - - movd m1, [r0 + 4 * r1] - - punpcklbw m3, m4 - punpcklbw m5, m1 - punpcklbw m3, m5 - - pmaddubsw m3, m0 - - phaddw m2, m3 - - pmulhrsw m2, [pw_512] - packuswb m2, m2 - movd [r2], m2 - pextrd [r2 + r3], m2, 1 - - RET - -%macro FILTER_VER_CHROMA_AVX2_4x2 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_4x2, 4, 6, 4 - mov r4d, r4m - shl r4d, 5 - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeff_V] - add r5, r4 -%else - lea r5, [tab_ChromaCoeff_V + r4] -%endif - - lea r4, [r1 * 3] - - movd xm1, [r0] - movd xm2, [r0 + r1] - punpcklbw xm1, xm2 - movd xm3, [r0 + r1 * 2] - punpcklbw xm2, xm3 - movlhps xm1, xm2 - movd xm0, [r0 + r4] - punpcklbw xm3, xm0 - movd xm2, [r0 + r1 * 4] - punpcklbw xm0, xm2 - movlhps xm3, xm0 - vinserti128 m1, m1, xm3, 1 ; m1 = row[x x x 4 3 2 1 0] - - pmaddubsw m1, [r5] - vextracti128 xm3, m1, 1 - paddw xm1, xm3 -%ifidn %1,pp - pmulhrsw xm1, [pw_512] - packuswb xm1, xm1 - movd [r2], xm1 - pextrd [r2 + r3], xm1, 1 -%else - add r3d, r3d - psubw xm1, [pw_2000] - movq [r2], xm1 - movhps [r2 + r3], xm1 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_4x2 pp - FILTER_VER_CHROMA_AVX2_4x2 ps - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse4 -cglobal interp_4tap_vert_pp_4x4, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m0, [tab_Cm] - mova m1, [pw_512] - lea r5, [r0 + 4 * r1] - lea r4, [r1 * 3] - - movd m2, [r0] - movd m3, [r0 + r1] - movd m4, [r0 + 2 * r1] - movd m5, [r0 + r4] - - punpcklbw m2, m3 - punpcklbw m6, m4, m5 - punpcklbw m2, m6 - - pmaddubsw m2, m0 - - movd m6, [r5] - - punpcklbw m3, m4 - punpcklbw m7, m5, m6 - punpcklbw m3, m7 - - pmaddubsw m3, m0 - - phaddw m2, m3 - - pmulhrsw m2, m1 - - movd m7, [r5 + r1] - - punpcklbw m4, m5 - punpcklbw m3, m6, m7 - punpcklbw m4, m3 - - pmaddubsw m4, m0 - - movd m3, [r5 + 2 * r1] - - punpcklbw m5, m6 - punpcklbw m7, m3 - punpcklbw m5, m7 - - pmaddubsw m5, m0 - - phaddw m4, m5 - - pmulhrsw m4, m1 - - packuswb m2, m4 - movd [r2], m2 - pextrd [r2 + r3], m2, 1 - lea r2, [r2 + 2 * r3] - pextrd [r2], m2, 2 - pextrd [r2 + r3], m2, 3 - RET - -%macro FILTER_VER_CHROMA_AVX2_4x4 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_4x4, 4, 6, 3 - mov r4d, r4m - shl r4d, 6 - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - - movd xm1, [r0] - pinsrd xm1, [r0 + r1], 1 - pinsrd xm1, [r0 + r1 * 2], 2 - pinsrd xm1, [r0 + r4], 3 ; m1 = row[3 2 1 0] - lea r0, [r0 + r1 * 4] - movd xm2, [r0] - pinsrd xm2, [r0 + r1], 1 - pinsrd xm2, [r0 + r1 * 2], 2 ; m2 = row[x 6 5 4] - vinserti128 m1, m1, xm2, 1 ; m1 = row[x 6 5 4 3 2 1 0] - mova m2, [v4_interp4_vpp_shuf1] - vpermd m0, m2, m1 ; m0 = row[4 3 3 2 2 1 1 0] - mova m2, [v4_interp4_vpp_shuf1 + mmsize] - vpermd m1, m2, m1 ; m1 = row[6 5 5 4 4 3 3 2] - - mova m2, [v4_interp4_vpp_shuf] - pshufb m0, m0, m2 - pshufb m1, m1, m2 - pmaddubsw m0, [r5] - pmaddubsw m1, [r5 + mmsize] - paddw m0, m1 ; m0 = WORD ROW[3 2 1 0] -%ifidn %1,pp - pmulhrsw m0, [pw_512] - vextracti128 xm1, m0, 1 - packuswb xm0, xm1 - lea r5, [r3 * 3] - movd [r2], xm0 - pextrd [r2 + r3], xm0, 1 - pextrd [r2 + r3 * 2], xm0, 2 - pextrd [r2 + r5], xm0, 3 -%else - add r3d, r3d - psubw m0, [pw_2000] - vextracti128 xm1, m0, 1 - lea r5, [r3 * 3] - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm1 - movhps [r2 + r5], xm1 -%endif - RET -%endmacro - FILTER_VER_CHROMA_AVX2_4x4 pp - FILTER_VER_CHROMA_AVX2_4x4 ps - -%macro FILTER_VER_CHROMA_AVX2_4x8 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_4x8, 4, 6, 5 - mov r4d, r4m - shl r4d, 6 - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - - movd xm1, [r0] - pinsrd xm1, [r0 + r1], 1 - pinsrd xm1, [r0 + r1 * 2], 2 - pinsrd xm1, [r0 + r4], 3 ; m1 = row[3 2 1 0] - lea r0, [r0 + r1 * 4] - movd xm2, [r0] - pinsrd xm2, [r0 + r1], 1 - pinsrd xm2, [r0 + r1 * 2], 2 - pinsrd xm2, [r0 + r4], 3 ; m2 = row[7 6 5 4] - vinserti128 m1, m1, xm2, 1 ; m1 = row[7 6 5 4 3 2 1 0] - lea r0, [r0 + r1 * 4] - movd xm3, [r0] - pinsrd xm3, [r0 + r1], 1 - pinsrd xm3, [r0 + r1 * 2], 2 ; m3 = row[x 10 9 8] - vinserti128 m2, m2, xm3, 1 ; m2 = row[x 10 9 8 7 6 5 4] - mova m3, [v4_interp4_vpp_shuf1] - vpermd m0, m3, m1 ; m0 = row[4 3 3 2 2 1 1 0] - vpermd m4, m3, m2 ; m4 = row[8 7 7 6 6 5 5 4] - mova m3, [v4_interp4_vpp_shuf1 + mmsize] - vpermd m1, m3, m1 ; m1 = row[6 5 5 4 4 3 3 2] - vpermd m2, m3, m2 ; m2 = row[10 9 9 8 8 7 7 6] - - mova m3, [v4_interp4_vpp_shuf] - pshufb m0, m0, m3 - pshufb m1, m1, m3 - pshufb m2, m2, m3 - pshufb m4, m4, m3 - pmaddubsw m0, [r5] - pmaddubsw m4, [r5] - pmaddubsw m1, [r5 + mmsize] - pmaddubsw m2, [r5 + mmsize] - paddw m0, m1 ; m0 = WORD ROW[3 2 1 0] - paddw m4, m2 ; m4 = WORD ROW[7 6 5 4] -%ifidn %1,pp - pmulhrsw m0, [pw_512] - pmulhrsw m4, [pw_512] - packuswb m0, m4 - vextracti128 xm1, m0, 1 - lea r5, [r3 * 3] - movd [r2], xm0 - pextrd [r2 + r3], xm0, 1 - movd [r2 + r3 * 2], xm1 - pextrd [r2 + r5], xm1, 1 - lea r2, [r2 + r3 * 4] - pextrd [r2], xm0, 2 - pextrd [r2 + r3], xm0, 3 - pextrd [r2 + r3 * 2], xm1, 2 - pextrd [r2 + r5], xm1, 3 -%else - add r3d, r3d - psubw m0, [pw_2000] - psubw m4, [pw_2000] - vextracti128 xm1, m0, 1 - vextracti128 xm2, m4, 1 - lea r5, [r3 * 3] - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm1 - movhps [r2 + r5], xm1 - lea r2, [r2 + r3 * 4] - movq [r2], xm4 - movhps [r2 + r3], xm4 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r5], xm2 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_4x8 pp - FILTER_VER_CHROMA_AVX2_4x8 ps - -%macro FILTER_VER_CHROMA_AVX2_4xN 2 -%if ARCH_X86_64 == 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_4x%2, 4, 6, 12 - mov r4d, r4m - shl r4d, 6 - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - mova m10, [r5] - mova m11, [r5 + mmsize] -%ifidn %1,pp - mova m9, [pw_512] -%else - add r3d, r3d - mova m9, [pw_2000] -%endif - lea r5, [r3 * 3] -%rep %2 / 16 - movd xm1, [r0] - pinsrd xm1, [r0 + r1], 1 - pinsrd xm1, [r0 + r1 * 2], 2 - pinsrd xm1, [r0 + r4], 3 ; m1 = row[3 2 1 0] - lea r0, [r0 + r1 * 4] - movd xm2, [r0] - pinsrd xm2, [r0 + r1], 1 - pinsrd xm2, [r0 + r1 * 2], 2 - pinsrd xm2, [r0 + r4], 3 ; m2 = row[7 6 5 4] - vinserti128 m1, m1, xm2, 1 ; m1 = row[7 6 5 4 3 2 1 0] - lea r0, [r0 + r1 * 4] - movd xm3, [r0] - pinsrd xm3, [r0 + r1], 1 - pinsrd xm3, [r0 + r1 * 2], 2 - pinsrd xm3, [r0 + r4], 3 ; m3 = row[11 10 9 8] - vinserti128 m2, m2, xm3, 1 ; m2 = row[11 10 9 8 7 6 5 4] - lea r0, [r0 + r1 * 4] - movd xm4, [r0] - pinsrd xm4, [r0 + r1], 1 - pinsrd xm4, [r0 + r1 * 2], 2 - pinsrd xm4, [r0 + r4], 3 ; m4 = row[15 14 13 12] - vinserti128 m3, m3, xm4, 1 ; m3 = row[15 14 13 12 11 10 9 8] - lea r0, [r0 + r1 * 4] - movd xm5, [r0] - pinsrd xm5, [r0 + r1], 1 - pinsrd xm5, [r0 + r1 * 2], 2 ; m5 = row[x 18 17 16] - vinserti128 m4, m4, xm5, 1 ; m4 = row[x 18 17 16 15 14 13 12] - mova m5, [v4_interp4_vpp_shuf1] - vpermd m0, m5, m1 ; m0 = row[4 3 3 2 2 1 1 0] - vpermd m6, m5, m2 ; m6 = row[8 7 7 6 6 5 5 4] - vpermd m7, m5, m3 ; m7 = row[12 11 11 10 10 9 9 8] - vpermd m8, m5, m4 ; m8 = row[16 15 15 14 14 13 13 12] - mova m5, [v4_interp4_vpp_shuf1 + mmsize] - vpermd m1, m5, m1 ; m1 = row[6 5 5 4 4 3 3 2] - vpermd m2, m5, m2 ; m2 = row[10 9 9 8 8 7 7 6] - vpermd m3, m5, m3 ; m3 = row[14 13 13 12 12 11 11 10] - vpermd m4, m5, m4 ; m4 = row[18 17 17 16 16 15 15 14] - - mova m5, [v4_interp4_vpp_shuf] - pshufb m0, m0, m5 - pshufb m1, m1, m5 - pshufb m2, m2, m5 - pshufb m4, m4, m5 - pshufb m3, m3, m5 - pshufb m6, m6, m5 - pshufb m7, m7, m5 - pshufb m8, m8, m5 - pmaddubsw m0, m10 - pmaddubsw m6, m10 - pmaddubsw m7, m10 - pmaddubsw m8, m10 - pmaddubsw m1, m11 - pmaddubsw m2, m11 - pmaddubsw m3, m11 - pmaddubsw m4, m11 - paddw m0, m1 ; m0 = WORD ROW[3 2 1 0] - paddw m6, m2 ; m6 = WORD ROW[7 6 5 4] - paddw m7, m3 ; m7 = WORD ROW[11 10 9 8] - paddw m8, m4 ; m8 = WORD ROW[15 14 13 12] -%ifidn %1,pp - pmulhrsw m0, m9 - pmulhrsw m6, m9 - pmulhrsw m7, m9 - pmulhrsw m8, m9 - packuswb m0, m6 - packuswb m7, m8 - vextracti128 xm1, m0, 1 - vextracti128 xm2, m7, 1 - movd [r2], xm0 - pextrd [r2 + r3], xm0, 1 - movd [r2 + r3 * 2], xm1 - pextrd [r2 + r5], xm1, 1 - lea r2, [r2 + r3 * 4] - pextrd [r2], xm0, 2 - pextrd [r2 + r3], xm0, 3 - pextrd [r2 + r3 * 2], xm1, 2 - pextrd [r2 + r5], xm1, 3 - lea r2, [r2 + r3 * 4] - movd [r2], xm7 - pextrd [r2 + r3], xm7, 1 - movd [r2 + r3 * 2], xm2 - pextrd [r2 + r5], xm2, 1 - lea r2, [r2 + r3 * 4] - pextrd [r2], xm7, 2 - pextrd [r2 + r3], xm7, 3 - pextrd [r2 + r3 * 2], xm2, 2 - pextrd [r2 + r5], xm2, 3 -%else - psubw m0, m9 - psubw m6, m9 - psubw m7, m9 - psubw m8, m9 - vextracti128 xm1, m0, 1 - vextracti128 xm2, m6, 1 - vextracti128 xm3, m7, 1 - vextracti128 xm4, m8, 1 - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm1 - movhps [r2 + r5], xm1 - lea r2, [r2 + r3 * 4] - movq [r2], xm6 - movhps [r2 + r3], xm6 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r5], xm2 - lea r2, [r2 + r3 * 4] - movq [r2], xm7 - movhps [r2 + r3], xm7 - movq [r2 + r3 * 2], xm3 - movhps [r2 + r5], xm3 - lea r2, [r2 + r3 * 4] - movq [r2], xm8 - movhps [r2 + r3], xm8 - movq [r2 + r3 * 2], xm4 - movhps [r2 + r5], xm4 -%endif - lea r2, [r2 + r3 * 4] -%endrep - RET -%endif -%endmacro - - FILTER_VER_CHROMA_AVX2_4xN pp, 16 - FILTER_VER_CHROMA_AVX2_4xN ps, 16 - FILTER_VER_CHROMA_AVX2_4xN pp, 32 - FILTER_VER_CHROMA_AVX2_4xN ps, 32 - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W4_H4 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m0, [tab_Cm] - - mova m1, [pw_512] - - mov r4d, %2 - - lea r5, [3 * r1] - -.loop: - movd m2, [r0] - movd m3, [r0 + r1] - movd m4, [r0 + 2 * r1] - movd m5, [r0 + r5] - - punpcklbw m2, m3 - punpcklbw m6, m4, m5 - punpcklbw m2, m6 - - pmaddubsw m2, m0 - - lea r0, [r0 + 4 * r1] - movd m6, [r0] - - punpcklbw m3, m4 - punpcklbw m7, m5, m6 - punpcklbw m3, m7 - - pmaddubsw m3, m0 - - phaddw m2, m3 - - pmulhrsw m2, m1 - - movd m7, [r0 + r1] - - punpcklbw m4, m5 - punpcklbw m3, m6, m7 - punpcklbw m4, m3 - - pmaddubsw m4, m0 - - movd m3, [r0 + 2 * r1] - - punpcklbw m5, m6 - punpcklbw m7, m3 - punpcklbw m5, m7 - - pmaddubsw m5, m0 - - phaddw m4, m5 - - pmulhrsw m4, m1 - packuswb m2, m4 - movd [r2], m2 - pextrd [r2 + r3], m2, 1 - lea r2, [r2 + 2 * r3] - pextrd [r2], m2, 2 - pextrd [r2 + r3], m2, 3 - - lea r2, [r2 + 2 * r3] - - sub r4, 4 - jnz .loop - RET -%endmacro - - FILTER_V4_W4_H4 4, 8 - FILTER_V4_W4_H4 4, 16 - - FILTER_V4_W4_H4 4, 32 - -%macro FILTER_V4_W8_H2 0 - punpcklbw m1, m2 - punpcklbw m7, m3, m0 - - pmaddubsw m1, m6 - pmaddubsw m7, m5 - - paddw m1, m7 - - pmulhrsw m1, m4 - packuswb m1, m1 -%endmacro - -%macro FILTER_V4_W8_H3 0 - punpcklbw m2, m3 - punpcklbw m7, m0, m1 - - pmaddubsw m2, m6 - pmaddubsw m7, m5 - - paddw m2, m7 - - pmulhrsw m2, m4 - packuswb m2, m2 -%endmacro - -%macro FILTER_V4_W8_H4 0 - punpcklbw m3, m0 - punpcklbw m7, m1, m2 - - pmaddubsw m3, m6 - pmaddubsw m7, m5 - - paddw m3, m7 - - pmulhrsw m3, m4 - packuswb m3, m3 -%endmacro - -%macro FILTER_V4_W8_H5 0 - punpcklbw m0, m1 - punpcklbw m7, m2, m3 - - pmaddubsw m0, m6 - pmaddubsw m7, m5 - - paddw m0, m7 - - pmulhrsw m0, m4 - packuswb m0, m0 -%endmacro - -%macro FILTER_V4_W8_8x2 2 - FILTER_V4_W8 %1, %2 - movq m0, [r0 + 4 * r1] - - FILTER_V4_W8_H2 - - movh [r2 + r3], m1 -%endmacro - -%macro FILTER_V4_W8_8x4 2 - FILTER_V4_W8_8x2 %1, %2 -;8x3 - lea r6, [r0 + 4 * r1] - movq m1, [r6 + r1] - - FILTER_V4_W8_H3 - - movh [r2 + 2 * r3], m2 - -;8x4 - movq m2, [r6 + 2 * r1] - - FILTER_V4_W8_H4 - - lea r5, [r2 + 2 * r3] - movh [r5 + r3], m3 -%endmacro - -%macro FILTER_V4_W8_8x6 2 - FILTER_V4_W8_8x4 %1, %2 -;8x5 - lea r6, [r6 + 2 * r1] - movq m3, [r6 + r1] - - FILTER_V4_W8_H5 - - movh [r2 + 4 * r3], m0 - -;8x6 - movq m0, [r0 + 8 * r1] - - FILTER_V4_W8_H2 - - lea r5, [r2 + 4 * r3] - movh [r5 + r3], m1 -%endmacro - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W8 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_pp_%1x%2, 4, 7, 8 - - mov r4d, r4m - - sub r0, r1 - movq m0, [r0] - movq m1, [r0 + r1] - movq m2, [r0 + 2 * r1] - lea r5, [r0 + 2 * r1] - movq m3, [r5 + r1] - - punpcklbw m0, m1 - punpcklbw m4, m2, m3 - -%ifdef PIC - lea r6, [tab_ChromaCoeff] - movd m5, [r6 + r4 * 4] -%else - movd m5, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m6, m5, [tab_Vm] - pmaddubsw m0, m6 - - pshufb m5, [tab_Vm + 16] - pmaddubsw m4, m5 - - paddw m0, m4 - - mova m4, [pw_512] - - pmulhrsw m0, m4 - packuswb m0, m0 - movh [r2], m0 -%endmacro - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_8x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- - FILTER_V4_W8_8x2 8, 2 - - RET - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_8x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- - FILTER_V4_W8_8x4 8, 4 - - RET - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_8x6(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- - FILTER_V4_W8_8x6 8, 6 - - RET - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_ps_4x2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;------------------------------------------------------------------------------------------------------------- -INIT_XMM sse4 -cglobal interp_4tap_vert_ps_4x2, 4, 6, 6 - - mov r4d, r4m - sub r0, r1 - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m0, [tab_Cm] - - movd m2, [r0] - movd m3, [r0 + r1] - lea r5, [r0 + 2 * r1] - movd m4, [r5] - movd m5, [r5 + r1] - - punpcklbw m2, m3 - punpcklbw m1, m4, m5 - punpcklbw m2, m1 - - pmaddubsw m2, m0 - - movd m1, [r0 + 4 * r1] - - punpcklbw m3, m4 - punpcklbw m5, m1 - punpcklbw m3, m5 - - pmaddubsw m3, m0 - - phaddw m2, m3 - - psubw m2, [pw_2000] - movh [r2], m2 - movhps [r2 + r3], m2 - - RET - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_ps_4x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;------------------------------------------------------------------------------------------------------------- -INIT_XMM sse4 -cglobal interp_4tap_vert_ps_4x4, 4, 6, 7 - - mov r4d, r4m - sub r0, r1 - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m0, [tab_Cm] - - lea r4, [r1 * 3] - lea r5, [r0 + 4 * r1] - - movd m2, [r0] - movd m3, [r0 + r1] - movd m4, [r0 + 2 * r1] - movd m5, [r0 + r4] - - punpcklbw m2, m3 - punpcklbw m6, m4, m5 - punpcklbw m2, m6 - - pmaddubsw m2, m0 - - movd m6, [r5] - - punpcklbw m3, m4 - punpcklbw m1, m5, m6 - punpcklbw m3, m1 - - pmaddubsw m3, m0 - - phaddw m2, m3 - - mova m1, [pw_2000] - - psubw m2, m1 - movh [r2], m2 - movhps [r2 + r3], m2 - - movd m2, [r5 + r1] - - punpcklbw m4, m5 - punpcklbw m3, m6, m2 - punpcklbw m4, m3 - - pmaddubsw m4, m0 - - movd m3, [r5 + 2 * r1] - - punpcklbw m5, m6 - punpcklbw m2, m3 - punpcklbw m5, m2 - - pmaddubsw m5, m0 - - phaddw m4, m5 - - psubw m4, m1 - lea r2, [r2 + 2 * r3] - movh [r2], m4 - movhps [r2 + r3], m4 - - RET - -;--------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_ps_%1x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;--------------------------------------------------------------------------------------------------------------- -%macro FILTER_V_PS_W4_H4 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_ps_%1x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m0, [tab_Cm] - - mova m1, [pw_2000] - - mov r4d, %2/4 - lea r5, [3 * r1] - -.loop: - movd m2, [r0] - movd m3, [r0 + r1] - movd m4, [r0 + 2 * r1] - movd m5, [r0 + r5] - - punpcklbw m2, m3 - punpcklbw m6, m4, m5 - punpcklbw m2, m6 - - pmaddubsw m2, m0 - - lea r0, [r0 + 4 * r1] - movd m6, [r0] - - punpcklbw m3, m4 - punpcklbw m7, m5, m6 - punpcklbw m3, m7 - - pmaddubsw m3, m0 - - phaddw m2, m3 - - psubw m2, m1 - movh [r2], m2 - movhps [r2 + r3], m2 - - movd m2, [r0 + r1] - - punpcklbw m4, m5 - punpcklbw m3, m6, m2 - punpcklbw m4, m3 - - pmaddubsw m4, m0 - - movd m3, [r0 + 2 * r1] - - punpcklbw m5, m6 - punpcklbw m2, m3 - punpcklbw m5, m2 - - pmaddubsw m5, m0 - - phaddw m4, m5 - - psubw m4, m1 - lea r2, [r2 + 2 * r3] - movh [r2], m4 - movhps [r2 + r3], m4 - - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loop - RET -%endmacro - - FILTER_V_PS_W4_H4 4, 8 - FILTER_V_PS_W4_H4 4, 16 - - FILTER_V_PS_W4_H4 4, 32 - -;-------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;-------------------------------------------------------------------------------------------------------------- -%macro FILTER_V_PS_W8_H8_H16_H2 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_ps_%1x%2, 4, 6, 7 - - mov r4d, r4m - sub r0, r1 - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m5, [r5 + r4 * 4] -%else - movd m5, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m6, m5, [tab_Vm] - pshufb m5, [tab_Vm + 16] - mova m4, [pw_2000] - - mov r4d, %2/2 - lea r5, [3 * r1] - -.loopH: - movq m0, [r0] - movq m1, [r0 + r1] - movq m2, [r0 + 2 * r1] - movq m3, [r0 + r5] - - punpcklbw m0, m1 - punpcklbw m1, m2 - punpcklbw m2, m3 - - pmaddubsw m0, m6 - pmaddubsw m2, m5 - - paddw m0, m2 - - psubw m0, m4 - movu [r2], m0 - - movq m0, [r0 + 4 * r1] - - punpcklbw m3, m0 - - pmaddubsw m1, m6 - pmaddubsw m3, m5 - - paddw m1, m3 - psubw m1, m4 - - movu [r2 + r3], m1 - - lea r0, [r0 + 2 * r1] - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loopH - - RET -%endmacro - - FILTER_V_PS_W8_H8_H16_H2 8, 2 - FILTER_V_PS_W8_H8_H16_H2 8, 4 - FILTER_V_PS_W8_H8_H16_H2 8, 6 - - FILTER_V_PS_W8_H8_H16_H2 8, 12 - FILTER_V_PS_W8_H8_H16_H2 8, 64 - -;-------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;-------------------------------------------------------------------------------------------------------------- -%macro FILTER_V_PS_W8_H8_H16_H32 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_ps_%1x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m5, [r5 + r4 * 4] -%else - movd m5, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m6, m5, [tab_Vm] - pshufb m5, [tab_Vm + 16] - mova m4, [pw_2000] - - mov r4d, %2/4 - lea r5, [3 * r1] - -.loop: - movq m0, [r0] - movq m1, [r0 + r1] - movq m2, [r0 + 2 * r1] - movq m3, [r0 + r5] - - punpcklbw m0, m1 - punpcklbw m1, m2 - punpcklbw m2, m3 - - pmaddubsw m0, m6 - pmaddubsw m7, m2, m5 - - paddw m0, m7 - - psubw m0, m4 - movu [r2], m0 - - lea r0, [r0 + 4 * r1] - movq m0, [r0] - - punpcklbw m3, m0 - - pmaddubsw m1, m6 - pmaddubsw m7, m3, m5 - - paddw m1, m7 - - psubw m1, m4 - movu [r2 + r3], m1 - - movq m1, [r0 + r1] - - punpcklbw m0, m1 - - pmaddubsw m2, m6 - pmaddubsw m0, m5 - - paddw m2, m0 - - psubw m2, m4 - lea r2, [r2 + 2 * r3] - movu [r2], m2 - - movq m2, [r0 + 2 * r1] - - punpcklbw m1, m2 - - pmaddubsw m3, m6 - pmaddubsw m1, m5 - - paddw m3, m1 - psubw m3, m4 - - movu [r2 + r3], m3 - - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loop - RET -%endmacro - - FILTER_V_PS_W8_H8_H16_H32 8, 8 - FILTER_V_PS_W8_H8_H16_H32 8, 16 - FILTER_V_PS_W8_H8_H16_H32 8, 32 - -;------------------------------------------------------------------------------------------------------------ -;void interp_4tap_vert_ps_6x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;------------------------------------------------------------------------------------------------------------ -%macro FILTER_V_PS_W6 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_ps_6x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m5, [r5 + r4 * 4] -%else - movd m5, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m6, m5, [tab_Vm] - pshufb m5, [tab_Vm + 16] - mova m4, [pw_2000] - lea r5, [3 * r1] - mov r4d, %2/4 - -.loop: - movq m0, [r0] - movq m1, [r0 + r1] - movq m2, [r0 + 2 * r1] - movq m3, [r0 + r5] - - punpcklbw m0, m1 - punpcklbw m1, m2 - punpcklbw m2, m3 - - pmaddubsw m0, m6 - pmaddubsw m7, m2, m5 - - paddw m0, m7 - psubw m0, m4 - - movh [r2], m0 - pshufd m0, m0, 2 - movd [r2 + 8], m0 - - lea r0, [r0 + 4 * r1] - movq m0, [r0] - punpcklbw m3, m0 - - pmaddubsw m1, m6 - pmaddubsw m7, m3, m5 - - paddw m1, m7 - psubw m1, m4 - - movh [r2 + r3], m1 - pshufd m1, m1, 2 - movd [r2 + r3 + 8], m1 - - movq m1, [r0 + r1] - punpcklbw m0, m1 - - pmaddubsw m2, m6 - pmaddubsw m0, m5 - - paddw m2, m0 - psubw m2, m4 - - lea r2,[r2 + 2 * r3] - movh [r2], m2 - pshufd m2, m2, 2 - movd [r2 + 8], m2 - - movq m2,[r0 + 2 * r1] - punpcklbw m1, m2 - - pmaddubsw m3, m6 - pmaddubsw m1, m5 - - paddw m3, m1 - psubw m3, m4 - - movh [r2 + r3], m3 - pshufd m3, m3, 2 - movd [r2 + r3 + 8], m3 - - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loop - RET -%endmacro - - FILTER_V_PS_W6 6, 8 - FILTER_V_PS_W6 6, 16 - -;--------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_ps_12x16(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;--------------------------------------------------------------------------------------------------------------- -%macro FILTER_V_PS_W12 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_ps_12x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m1, m0, [tab_Vm] - pshufb m0, [tab_Vm + 16] - - mov r4d, %2/2 - -.loop: - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - pmaddubsw m4, m1 - pmaddubsw m2, m1 - - lea r0, [r0 + 2 * r1] - movu m5, [r0] - movu m7, [r0 + r1] - - punpcklbw m6, m5, m7 - pmaddubsw m6, m0 - paddw m4, m6 - - punpckhbw m6, m5, m7 - pmaddubsw m6, m0 - paddw m2, m6 - - mova m6, [pw_2000] - - psubw m4, m6 - psubw m2, m6 - - movu [r2], m4 - movh [r2 + 16], m2 - - punpcklbw m4, m3, m5 - punpckhbw m3, m5 - - pmaddubsw m4, m1 - pmaddubsw m3, m1 - - movu m2, [r0 + 2 * r1] - - punpcklbw m5, m7, m2 - punpckhbw m7, m2 - - pmaddubsw m5, m0 - pmaddubsw m7, m0 - - paddw m4, m5 - paddw m3, m7 - - psubw m4, m6 - psubw m3, m6 - - movu [r2 + r3], m4 - movh [r2 + r3 + 16], m3 - - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loop - RET -%endmacro - - FILTER_V_PS_W12 12, 16 - FILTER_V_PS_W12 12, 32 - -;--------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_ps_16x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;--------------------------------------------------------------------------------------------------------------- -%macro FILTER_V_PS_W16 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_ps_%1x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m1, m0, [tab_Vm] - pshufb m0, [tab_Vm + 16] - mov r4d, %2/2 - -.loop: - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - pmaddubsw m4, m1 - pmaddubsw m2, m1 - - lea r0, [r0 + 2 * r1] - movu m5, [r0] - movu m7, [r0 + r1] - - punpcklbw m6, m5, m7 - pmaddubsw m6, m0 - paddw m4, m6 - - punpckhbw m6, m5, m7 - pmaddubsw m6, m0 - paddw m2, m6 - - mova m6, [pw_2000] - - psubw m4, m6 - psubw m2, m6 - - movu [r2], m4 - movu [r2 + 16], m2 - - punpcklbw m4, m3, m5 - punpckhbw m3, m5 - - pmaddubsw m4, m1 - pmaddubsw m3, m1 - - movu m5, [r0 + 2 * r1] - - punpcklbw m2, m7, m5 - punpckhbw m7, m5 - - pmaddubsw m2, m0 - pmaddubsw m7, m0 - - paddw m4, m2 - paddw m3, m7 - - psubw m4, m6 - psubw m3, m6 - - movu [r2 + r3], m4 - movu [r2 + r3 + 16], m3 - - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loop - RET -%endmacro - - FILTER_V_PS_W16 16, 4 - FILTER_V_PS_W16 16, 8 - FILTER_V_PS_W16 16, 12 - FILTER_V_PS_W16 16, 16 - FILTER_V_PS_W16 16, 32 - - FILTER_V_PS_W16 16, 24 - FILTER_V_PS_W16 16, 64 - -;-------------------------------------------------------------------------------------------------------------- -;void interp_4tap_vert_ps_24x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;-------------------------------------------------------------------------------------------------------------- -%macro FILTER_V4_PS_W24 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_ps_24x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m1, m0, [tab_Vm] - pshufb m0, [tab_Vm + 16] - - mov r4d, %2/2 - -.loop: - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - pmaddubsw m4, m1 - pmaddubsw m2, m1 - - lea r5, [r0 + 2 * r1] - - movu m5, [r5] - movu m7, [r5 + r1] - - punpcklbw m6, m5, m7 - pmaddubsw m6, m0 - paddw m4, m6 - - punpckhbw m6, m5, m7 - pmaddubsw m6, m0 - paddw m2, m6 - - mova m6, [pw_2000] - - psubw m4, m6 - psubw m2, m6 - - movu [r2], m4 - movu [r2 + 16], m2 - - punpcklbw m4, m3, m5 - punpckhbw m3, m5 - - pmaddubsw m4, m1 - pmaddubsw m3, m1 - - movu m2, [r5 + 2 * r1] - - punpcklbw m5, m7, m2 - punpckhbw m7, m2 - - pmaddubsw m5, m0 - pmaddubsw m7, m0 - - paddw m4, m5 - paddw m3, m7 - - psubw m4, m6 - psubw m3, m6 - - movu [r2 + r3], m4 - movu [r2 + r3 + 16], m3 - - movq m2, [r0 + 16] - movq m3, [r0 + r1 + 16] - movq m4, [r5 + 16] - movq m5, [r5 + r1 + 16] - - punpcklbw m2, m3 - punpcklbw m7, m4, m5 - - pmaddubsw m2, m1 - pmaddubsw m7, m0 - - paddw m2, m7 - psubw m2, m6 - - movu [r2 + 32], m2 - - movq m2, [r5 + 2 * r1 + 16] - - punpcklbw m3, m4 - punpcklbw m5, m2 - - pmaddubsw m3, m1 - pmaddubsw m5, m0 - - paddw m3, m5 - psubw m3, m6 - - movu [r2 + r3 + 32], m3 - - mov r0, r5 - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loop - RET -%endmacro - - FILTER_V4_PS_W24 24, 32 - - FILTER_V4_PS_W24 24, 64 - -;--------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_ps_32x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;--------------------------------------------------------------------------------------------------------------- -%macro FILTER_V_PS_W32 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_ps_%1x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m1, m0, [tab_Vm] - pshufb m0, [tab_Vm + 16] - - mova m7, [pw_2000] - - mov r4d, %2 - -.loop: - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - pmaddubsw m4, m1 - pmaddubsw m2, m1 - - lea r5, [r0 + 2 * r1] - movu m3, [r5] - movu m5, [r5 + r1] - - punpcklbw m6, m3, m5 - punpckhbw m3, m5 - - pmaddubsw m6, m0 - pmaddubsw m3, m0 - - paddw m4, m6 - paddw m2, m3 - - psubw m4, m7 - psubw m2, m7 - - movu [r2], m4 - movu [r2 + 16], m2 - - movu m2, [r0 + 16] - movu m3, [r0 + r1 + 16] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - pmaddubsw m4, m1 - pmaddubsw m2, m1 - - movu m3, [r5 + 16] - movu m5, [r5 + r1 + 16] - - punpcklbw m6, m3, m5 - punpckhbw m3, m5 - - pmaddubsw m6, m0 - pmaddubsw m3, m0 - - paddw m4, m6 - paddw m2, m3 - - psubw m4, m7 - psubw m2, m7 - - movu [r2 + 32], m4 - movu [r2 + 48], m2 - - lea r0, [r0 + r1] - lea r2, [r2 + r3] - - dec r4d - jnz .loop - RET -%endmacro - - FILTER_V_PS_W32 32, 8 - FILTER_V_PS_W32 32, 16 - FILTER_V_PS_W32 32, 24 - FILTER_V_PS_W32 32, 32 - - FILTER_V_PS_W32 32, 48 - FILTER_V_PS_W32 32, 64 - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W8_H8_H16_H32 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m5, [r5 + r4 * 4] -%else - movd m5, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m6, m5, [tab_Vm] - pshufb m5, [tab_Vm + 16] - mova m4, [pw_512] - lea r5, [r1 * 3] - - mov r4d, %2 - -.loop: - movq m0, [r0] - movq m1, [r0 + r1] - movq m2, [r0 + 2 * r1] - movq m3, [r0 + r5] - - punpcklbw m0, m1 - punpcklbw m1, m2 - punpcklbw m2, m3 - - pmaddubsw m0, m6 - pmaddubsw m7, m2, m5 - - paddw m0, m7 - - pmulhrsw m0, m4 - packuswb m0, m0 - movh [r2], m0 - - lea r0, [r0 + 4 * r1] - movq m0, [r0] - - punpcklbw m3, m0 - - pmaddubsw m1, m6 - pmaddubsw m7, m3, m5 - - paddw m1, m7 - - pmulhrsw m1, m4 - packuswb m1, m1 - movh [r2 + r3], m1 - - movq m1, [r0 + r1] - - punpcklbw m0, m1 - - pmaddubsw m2, m6 - pmaddubsw m0, m5 - - paddw m2, m0 - - pmulhrsw m2, m4 - - movq m7, [r0 + 2 * r1] - punpcklbw m1, m7 - - pmaddubsw m3, m6 - pmaddubsw m1, m5 - - paddw m3, m1 - - pmulhrsw m3, m4 - packuswb m2, m3 - - lea r2, [r2 + 2 * r3] - movh [r2], m2 - movhps [r2 + r3], m2 - - lea r2, [r2 + 2 * r3] - - sub r4, 4 - jnz .loop - RET -%endmacro - - FILTER_V4_W8_H8_H16_H32 8, 8 - FILTER_V4_W8_H8_H16_H32 8, 16 - FILTER_V4_W8_H8_H16_H32 8, 32 - - FILTER_V4_W8_H8_H16_H32 8, 12 - FILTER_V4_W8_H8_H16_H32 8, 64 - -%macro PROCESS_CHROMA_AVX2_W8_8R 0 - movq xm1, [r0] ; m1 = row 0 - movq xm2, [r0 + r1] ; m2 = row 1 - punpcklbw xm1, xm2 ; m1 = [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] - movq xm3, [r0 + r1 * 2] ; m3 = row 2 - punpcklbw xm2, xm3 ; m2 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - vinserti128 m5, m1, xm2, 1 ; m5 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] - pmaddubsw m5, [r5] - movq xm4, [r0 + r4] ; m4 = row 3 - punpcklbw xm3, xm4 ; m3 = [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] - lea r0, [r0 + r1 * 4] - movq xm1, [r0] ; m1 = row 4 - punpcklbw xm4, xm1 ; m4 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - vinserti128 m2, m3, xm4, 1 ; m2 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] - pmaddubsw m0, m2, [r5 + 1 * mmsize] - paddw m5, m0 - pmaddubsw m2, [r5] - movq xm3, [r0 + r1] ; m3 = row 5 - punpcklbw xm1, xm3 ; m1 = [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40] - movq xm4, [r0 + r1 * 2] ; m4 = row 6 - punpcklbw xm3, xm4 ; m3 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] - vinserti128 m1, m1, xm3, 1 ; m1 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] - [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40] - pmaddubsw m0, m1, [r5 + 1 * mmsize] - paddw m2, m0 - pmaddubsw m1, [r5] - movq xm3, [r0 + r4] ; m3 = row 7 - punpcklbw xm4, xm3 ; m4 = [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60] - lea r0, [r0 + r1 * 4] - movq xm0, [r0] ; m0 = row 8 - punpcklbw xm3, xm0 ; m3 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70] - vinserti128 m4, m4, xm3, 1 ; m4 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70] - [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60] - pmaddubsw m3, m4, [r5 + 1 * mmsize] - paddw m1, m3 - pmaddubsw m4, [r5] - movq xm3, [r0 + r1] ; m3 = row 9 - punpcklbw xm0, xm3 ; m0 = [97 87 96 86 95 85 94 84 93 83 92 82 91 81 90 80] - movq xm6, [r0 + r1 * 2] ; m6 = row 10 - punpcklbw xm3, xm6 ; m3 = [A7 97 A6 96 A5 95 A4 94 A3 93 A2 92 A1 91 A0 90] - vinserti128 m0, m0, xm3, 1 ; m0 = [A7 97 A6 96 A5 95 A4 94 A3 93 A2 92 A1 91 A0 90] - [97 87 96 86 95 85 94 84 93 83 92 82 91 81 90 80] - pmaddubsw m0, [r5 + 1 * mmsize] - paddw m4, m0 -%endmacro - -%macro FILTER_VER_CHROMA_AVX2_8x8 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x8, 4, 6, 7 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 - PROCESS_CHROMA_AVX2_W8_8R -%ifidn %1,pp - lea r4, [r3 * 3] - mova m3, [pw_512] - pmulhrsw m5, m3 ; m5 = word: row 0, row 1 - pmulhrsw m2, m3 ; m2 = word: row 2, row 3 - pmulhrsw m1, m3 ; m1 = word: row 4, row 5 - pmulhrsw m4, m3 ; m4 = word: row 6, row 7 - packuswb m5, m2 - packuswb m1, m4 - vextracti128 xm2, m5, 1 - vextracti128 xm4, m1, 1 - movq [r2], xm5 - movq [r2 + r3], xm2 - movhps [r2 + r3 * 2], xm5 - movhps [r2 + r4], xm2 - lea r2, [r2 + r3 * 4] - movq [r2], xm1 - movq [r2 + r3], xm4 - movhps [r2 + r3 * 2], xm1 - movhps [r2 + r4], xm4 -%else - add r3d, r3d - vbroadcasti128 m3, [pw_2000] - lea r4, [r3 * 3] - psubw m5, m3 ; m5 = word: row 0, row 1 - psubw m2, m3 ; m2 = word: row 2, row 3 - psubw m1, m3 ; m1 = word: row 4, row 5 - psubw m4, m3 ; m4 = word: row 6, row 7 - vextracti128 xm6, m5, 1 - vextracti128 xm3, m2, 1 - vextracti128 xm0, m1, 1 - movu [r2], xm5 - movu [r2 + r3], xm6 - movu [r2 + r3 * 2], xm2 - movu [r2 + r4], xm3 - lea r2, [r2 + r3 * 4] - movu [r2], xm1 - movu [r2 + r3], xm0 - movu [r2 + r3 * 2], xm4 - vextracti128 xm4, m4, 1 - movu [r2 + r4], xm4 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_8x8 pp - FILTER_VER_CHROMA_AVX2_8x8 ps - -%macro FILTER_VER_CHROMA_AVX2_8x6 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x6, 4, 6, 6 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 - - movq xm1, [r0] ; m1 = row 0 - movq xm2, [r0 + r1] ; m2 = row 1 - punpcklbw xm1, xm2 ; m1 = [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] - movq xm3, [r0 + r1 * 2] ; m3 = row 2 - punpcklbw xm2, xm3 ; m2 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - vinserti128 m5, m1, xm2, 1 ; m5 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] - pmaddubsw m5, [r5] - movq xm4, [r0 + r4] ; m4 = row 3 - punpcklbw xm3, xm4 ; m3 = [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] - lea r0, [r0 + r1 * 4] - movq xm1, [r0] ; m1 = row 4 - punpcklbw xm4, xm1 ; m4 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - vinserti128 m2, m3, xm4, 1 ; m2 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] - pmaddubsw m0, m2, [r5 + 1 * mmsize] - paddw m5, m0 - pmaddubsw m2, [r5] - movq xm3, [r0 + r1] ; m3 = row 5 - punpcklbw xm1, xm3 ; m1 = [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40] - movq xm4, [r0 + r1 * 2] ; m4 = row 6 - punpcklbw xm3, xm4 ; m3 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] - vinserti128 m1, m1, xm3, 1 ; m1 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] - [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40] - pmaddubsw m0, m1, [r5 + 1 * mmsize] - paddw m2, m0 - pmaddubsw m1, [r5] - movq xm3, [r0 + r4] ; m3 = row 7 - punpcklbw xm4, xm3 ; m4 = [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60] - lea r0, [r0 + r1 * 4] - movq xm0, [r0] ; m0 = row 8 - punpcklbw xm3, xm0 ; m3 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70] - vinserti128 m4, m4, xm3, 1 ; m4 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70] - [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60] - pmaddubsw m4, [r5 + 1 * mmsize] - paddw m1, m4 -%ifidn %1,pp - lea r4, [r3 * 3] - mova m3, [pw_512] - pmulhrsw m5, m3 ; m5 = word: row 0, row 1 - pmulhrsw m2, m3 ; m2 = word: row 2, row 3 - pmulhrsw m1, m3 ; m1 = word: row 4, row 5 - packuswb m5, m2 - packuswb m1, m1 - vextracti128 xm2, m5, 1 - vextracti128 xm4, m1, 1 - movq [r2], xm5 - movq [r2 + r3], xm2 - movhps [r2 + r3 * 2], xm5 - movhps [r2 + r4], xm2 - lea r2, [r2 + r3 * 4] - movq [r2], xm1 - movq [r2 + r3], xm4 -%else - add r3d, r3d - mova m3, [pw_2000] - lea r4, [r3 * 3] - psubw m5, m3 ; m5 = word: row 0, row 1 - psubw m2, m3 ; m2 = word: row 2, row 3 - psubw m1, m3 ; m1 = word: row 4, row 5 - vextracti128 xm4, m5, 1 - vextracti128 xm3, m2, 1 - vextracti128 xm0, m1, 1 - movu [r2], xm5 - movu [r2 + r3], xm4 - movu [r2 + r3 * 2], xm2 - movu [r2 + r4], xm3 - lea r2, [r2 + r3 * 4] - movu [r2], xm1 - movu [r2 + r3], xm0 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_8x6 pp - FILTER_VER_CHROMA_AVX2_8x6 ps - -%macro PROCESS_CHROMA_AVX2_W8_16R 1 - movq xm1, [r0] ; m1 = row 0 - movq xm2, [r0 + r1] ; m2 = row 1 - punpcklbw xm1, xm2 - movq xm3, [r0 + r1 * 2] ; m3 = row 2 - punpcklbw xm2, xm3 - vinserti128 m5, m1, xm2, 1 - pmaddubsw m5, [r5] - movq xm4, [r0 + r4] ; m4 = row 3 - punpcklbw xm3, xm4 - lea r0, [r0 + r1 * 4] - movq xm1, [r0] ; m1 = row 4 - punpcklbw xm4, xm1 - vinserti128 m2, m3, xm4, 1 - pmaddubsw m0, m2, [r5 + 1 * mmsize] - paddw m5, m0 - pmaddubsw m2, [r5] - movq xm3, [r0 + r1] ; m3 = row 5 - punpcklbw xm1, xm3 - movq xm4, [r0 + r1 * 2] ; m4 = row 6 - punpcklbw xm3, xm4 - vinserti128 m1, m1, xm3, 1 - pmaddubsw m0, m1, [r5 + 1 * mmsize] - paddw m2, m0 - pmaddubsw m1, [r5] - movq xm3, [r0 + r4] ; m3 = row 7 - punpcklbw xm4, xm3 - lea r0, [r0 + r1 * 4] - movq xm0, [r0] ; m0 = row 8 - punpcklbw xm3, xm0 - vinserti128 m4, m4, xm3, 1 - pmaddubsw m3, m4, [r5 + 1 * mmsize] - paddw m1, m3 - pmaddubsw m4, [r5] - movq xm3, [r0 + r1] ; m3 = row 9 - punpcklbw xm0, xm3 - movq xm6, [r0 + r1 * 2] ; m6 = row 10 - punpcklbw xm3, xm6 - vinserti128 m0, m0, xm3, 1 - pmaddubsw m3, m0, [r5 + 1 * mmsize] - paddw m4, m3 - pmaddubsw m0, [r5] -%ifidn %1,pp - pmulhrsw m5, m7 ; m5 = word: row 0, row 1 - pmulhrsw m2, m7 ; m2 = word: row 2, row 3 - pmulhrsw m1, m7 ; m1 = word: row 4, row 5 - pmulhrsw m4, m7 ; m4 = word: row 6, row 7 - packuswb m5, m2 - packuswb m1, m4 - vextracti128 xm2, m5, 1 - vextracti128 xm4, m1, 1 - movq [r2], xm5 - movq [r2 + r3], xm2 - movhps [r2 + r3 * 2], xm5 - movhps [r2 + r6], xm2 - lea r2, [r2 + r3 * 4] - movq [r2], xm1 - movq [r2 + r3], xm4 - movhps [r2 + r3 * 2], xm1 - movhps [r2 + r6], xm4 -%else - psubw m5, m7 ; m5 = word: row 0, row 1 - psubw m2, m7 ; m2 = word: row 2, row 3 - psubw m1, m7 ; m1 = word: row 4, row 5 - psubw m4, m7 ; m4 = word: row 6, row 7 - vextracti128 xm3, m5, 1 - movu [r2], xm5 - movu [r2 + r3], xm3 - vextracti128 xm3, m2, 1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm3 - lea r2, [r2 + r3 * 4] - vextracti128 xm5, m1, 1 - vextracti128 xm3, m4, 1 - movu [r2], xm1 - movu [r2 + r3], xm5 - movu [r2 + r3 * 2], xm4 - movu [r2 + r6], xm3 -%endif - movq xm3, [r0 + r4] ; m3 = row 11 - punpcklbw xm6, xm3 - lea r0, [r0 + r1 * 4] - movq xm5, [r0] ; m5 = row 12 - punpcklbw xm3, xm5 - vinserti128 m6, m6, xm3, 1 - pmaddubsw m3, m6, [r5 + 1 * mmsize] - paddw m0, m3 - pmaddubsw m6, [r5] - movq xm3, [r0 + r1] ; m3 = row 13 - punpcklbw xm5, xm3 - movq xm2, [r0 + r1 * 2] ; m2 = row 14 - punpcklbw xm3, xm2 - vinserti128 m5, m5, xm3, 1 - pmaddubsw m3, m5, [r5 + 1 * mmsize] - paddw m6, m3 - pmaddubsw m5, [r5] - movq xm3, [r0 + r4] ; m3 = row 15 - punpcklbw xm2, xm3 - lea r0, [r0 + r1 * 4] - movq xm1, [r0] ; m1 = row 16 - punpcklbw xm3, xm1 - vinserti128 m2, m2, xm3, 1 - pmaddubsw m3, m2, [r5 + 1 * mmsize] - paddw m5, m3 - pmaddubsw m2, [r5] - movq xm3, [r0 + r1] ; m3 = row 17 - punpcklbw xm1, xm3 - movq xm4, [r0 + r1 * 2] ; m4 = row 18 - punpcklbw xm3, xm4 - vinserti128 m1, m1, xm3, 1 - pmaddubsw m1, [r5 + 1 * mmsize] - paddw m2, m1 - lea r2, [r2 + r3 * 4] -%ifidn %1,pp - pmulhrsw m0, m7 ; m0 = word: row 8, row 9 - pmulhrsw m6, m7 ; m6 = word: row 10, row 11 - pmulhrsw m5, m7 ; m5 = word: row 12, row 13 - pmulhrsw m2, m7 ; m2 = word: row 14, row 15 - packuswb m0, m6 - packuswb m5, m2 - vextracti128 xm6, m0, 1 - vextracti128 xm2, m5, 1 - movq [r2], xm0 - movq [r2 + r3], xm6 - movhps [r2 + r3 * 2], xm0 - movhps [r2 + r6], xm6 - lea r2, [r2 + r3 * 4] - movq [r2], xm5 - movq [r2 + r3], xm2 - movhps [r2 + r3 * 2], xm5 - movhps [r2 + r6], xm2 -%else - psubw m0, m7 ; m0 = word: row 8, row 9 - psubw m6, m7 ; m6 = word: row 10, row 11 - psubw m5, m7 ; m5 = word: row 12, row 13 - psubw m2, m7 ; m2 = word: row 14, row 15 - vextracti128 xm1, m0, 1 - vextracti128 xm3, m6, 1 - movu [r2], xm0 - movu [r2 + r3], xm1 - movu [r2 + r3 * 2], xm6 - movu [r2 + r6], xm3 - lea r2, [r2 + r3 * 4] - vextracti128 xm1, m5, 1 - vextracti128 xm3, m2, 1 - movu [r2], xm5 - movu [r2 + r3], xm1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm3 -%endif -%endmacro - -%macro FILTER_VER_CHROMA_AVX2_8x16 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x16, 4, 7, 8 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - mova m7, [pw_512] -%else - add r3d, r3d - mova m7, [pw_2000] -%endif - lea r6, [r3 * 3] - PROCESS_CHROMA_AVX2_W8_16R %1 - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_8x16 pp - FILTER_VER_CHROMA_AVX2_8x16 ps - -%macro FILTER_VER_CHROMA_AVX2_8x12 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x12, 4, 7, 8 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1, pp - mova m7, [pw_512] -%else - add r3d, r3d - mova m7, [pw_2000] -%endif - lea r6, [r3 * 3] - movq xm1, [r0] ; m1 = row 0 - movq xm2, [r0 + r1] ; m2 = row 1 - punpcklbw xm1, xm2 - movq xm3, [r0 + r1 * 2] ; m3 = row 2 - punpcklbw xm2, xm3 - vinserti128 m5, m1, xm2, 1 - pmaddubsw m5, [r5] - movq xm4, [r0 + r4] ; m4 = row 3 - punpcklbw xm3, xm4 - lea r0, [r0 + r1 * 4] - movq xm1, [r0] ; m1 = row 4 - punpcklbw xm4, xm1 - vinserti128 m2, m3, xm4, 1 - pmaddubsw m0, m2, [r5 + 1 * mmsize] - paddw m5, m0 - pmaddubsw m2, [r5] - movq xm3, [r0 + r1] ; m3 = row 5 - punpcklbw xm1, xm3 - movq xm4, [r0 + r1 * 2] ; m4 = row 6 - punpcklbw xm3, xm4 - vinserti128 m1, m1, xm3, 1 - pmaddubsw m0, m1, [r5 + 1 * mmsize] - paddw m2, m0 - pmaddubsw m1, [r5] - movq xm3, [r0 + r4] ; m3 = row 7 - punpcklbw xm4, xm3 - lea r0, [r0 + r1 * 4] - movq xm0, [r0] ; m0 = row 8 - punpcklbw xm3, xm0 - vinserti128 m4, m4, xm3, 1 - pmaddubsw m3, m4, [r5 + 1 * mmsize] - paddw m1, m3 - pmaddubsw m4, [r5] - movq xm3, [r0 + r1] ; m3 = row 9 - punpcklbw xm0, xm3 - movq xm6, [r0 + r1 * 2] ; m6 = row 10 - punpcklbw xm3, xm6 - vinserti128 m0, m0, xm3, 1 - pmaddubsw m3, m0, [r5 + 1 * mmsize] - paddw m4, m3 - pmaddubsw m0, [r5] -%ifidn %1, pp - pmulhrsw m5, m7 ; m5 = word: row 0, row 1 - pmulhrsw m2, m7 ; m2 = word: row 2, row 3 - pmulhrsw m1, m7 ; m1 = word: row 4, row 5 - pmulhrsw m4, m7 ; m4 = word: row 6, row 7 - packuswb m5, m2 - packuswb m1, m4 - vextracti128 xm2, m5, 1 - vextracti128 xm4, m1, 1 - movq [r2], xm5 - movq [r2 + r3], xm2 - movhps [r2 + r3 * 2], xm5 - movhps [r2 + r6], xm2 - lea r2, [r2 + r3 * 4] - movq [r2], xm1 - movq [r2 + r3], xm4 - movhps [r2 + r3 * 2], xm1 - movhps [r2 + r6], xm4 -%else - psubw m5, m7 ; m5 = word: row 0, row 1 - psubw m2, m7 ; m2 = word: row 2, row 3 - psubw m1, m7 ; m1 = word: row 4, row 5 - psubw m4, m7 ; m4 = word: row 6, row 7 - vextracti128 xm3, m5, 1 - movu [r2], xm5 - movu [r2 + r3], xm3 - vextracti128 xm3, m2, 1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm3 - lea r2, [r2 + r3 * 4] - vextracti128 xm5, m1, 1 - vextracti128 xm3, m4, 1 - movu [r2], xm1 - movu [r2 + r3], xm5 - movu [r2 + r3 * 2], xm4 - movu [r2 + r6], xm3 -%endif - movq xm3, [r0 + r4] ; m3 = row 11 - punpcklbw xm6, xm3 - lea r0, [r0 + r1 * 4] - movq xm5, [r0] ; m5 = row 12 - punpcklbw xm3, xm5 - vinserti128 m6, m6, xm3, 1 - pmaddubsw m3, m6, [r5 + 1 * mmsize] - paddw m0, m3 - pmaddubsw m6, [r5] - movq xm3, [r0 + r1] ; m3 = row 13 - punpcklbw xm5, xm3 - movq xm2, [r0 + r1 * 2] ; m2 = row 14 - punpcklbw xm3, xm2 - vinserti128 m5, m5, xm3, 1 - pmaddubsw m3, m5, [r5 + 1 * mmsize] - paddw m6, m3 - lea r2, [r2 + r3 * 4] -%ifidn %1, pp - pmulhrsw m0, m7 ; m0 = word: row 8, row 9 - pmulhrsw m6, m7 ; m6 = word: row 10, row 11 - packuswb m0, m6 - vextracti128 xm6, m0, 1 - movq [r2], xm0 - movq [r2 + r3], xm6 - movhps [r2 + r3 * 2], xm0 - movhps [r2 + r6], xm6 -%else - psubw m0, m7 ; m0 = word: row 8, row 9 - psubw m6, m7 ; m6 = word: row 10, row 11 - vextracti128 xm1, m0, 1 - vextracti128 xm3, m6, 1 - movu [r2], xm0 - movu [r2 + r3], xm1 - movu [r2 + r3 * 2], xm6 - movu [r2 + r6], xm3 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_8x12 pp - FILTER_VER_CHROMA_AVX2_8x12 ps - -%macro FILTER_VER_CHROMA_AVX2_8xN 2 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x%2, 4, 7, 8 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - mova m7, [pw_512] -%else - add r3d, r3d - mova m7, [pw_2000] -%endif - lea r6, [r3 * 3] -%rep %2 / 16 - PROCESS_CHROMA_AVX2_W8_16R %1 - lea r2, [r2 + r3 * 4] -%endrep - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_8xN pp, 32 - FILTER_VER_CHROMA_AVX2_8xN ps, 32 - FILTER_VER_CHROMA_AVX2_8xN pp, 64 - FILTER_VER_CHROMA_AVX2_8xN ps, 64 - -%macro PROCESS_CHROMA_AVX2_W8_4R 0 - movq xm1, [r0] ; m1 = row 0 - movq xm2, [r0 + r1] ; m2 = row 1 - punpcklbw xm1, xm2 ; m1 = [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] - movq xm3, [r0 + r1 * 2] ; m3 = row 2 - punpcklbw xm2, xm3 ; m2 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - vinserti128 m0, m1, xm2, 1 ; m0 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] - pmaddubsw m0, [r5] - movq xm4, [r0 + r4] ; m4 = row 3 - punpcklbw xm3, xm4 ; m3 = [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] - lea r0, [r0 + r1 * 4] - movq xm1, [r0] ; m1 = row 4 - punpcklbw xm4, xm1 ; m4 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - vinserti128 m2, m3, xm4, 1 ; m2 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] - pmaddubsw m4, m2, [r5 + 1 * mmsize] - paddw m0, m4 - pmaddubsw m2, [r5] - movq xm3, [r0 + r1] ; m3 = row 5 - punpcklbw xm1, xm3 ; m1 = [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40] - movq xm4, [r0 + r1 * 2] ; m4 = row 6 - punpcklbw xm3, xm4 ; m3 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] - vinserti128 m1, m1, xm3, 1 ; m1 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] - [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40] - pmaddubsw m1, [r5 + 1 * mmsize] - paddw m2, m1 -%endmacro - -%macro FILTER_VER_CHROMA_AVX2_8x4 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x4, 4, 6, 5 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 - PROCESS_CHROMA_AVX2_W8_4R -%ifidn %1,pp - lea r4, [r3 * 3] - mova m3, [pw_512] - pmulhrsw m0, m3 ; m0 = word: row 0, row 1 - pmulhrsw m2, m3 ; m2 = word: row 2, row 3 - packuswb m0, m2 - vextracti128 xm2, m0, 1 - movq [r2], xm0 - movq [r2 + r3], xm2 - movhps [r2 + r3 * 2], xm0 - movhps [r2 + r4], xm2 -%else - add r3d, r3d - vbroadcasti128 m3, [pw_2000] - lea r4, [r3 * 3] - psubw m0, m3 ; m0 = word: row 0, row 1 - psubw m2, m3 ; m2 = word: row 2, row 3 - vextracti128 xm1, m0, 1 - vextracti128 xm4, m2, 1 - movu [r2], xm0 - movu [r2 + r3], xm1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r4], xm4 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_8x4 pp - FILTER_VER_CHROMA_AVX2_8x4 ps - -%macro FILTER_VER_CHROMA_AVX2_8x2 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x2, 4, 6, 4 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 - - movq xm1, [r0] ; m1 = row 0 - movq xm2, [r0 + r1] ; m2 = row 1 - punpcklbw xm1, xm2 ; m1 = [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] - movq xm3, [r0 + r1 * 2] ; m3 = row 2 - punpcklbw xm2, xm3 ; m2 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - vinserti128 m1, m1, xm2, 1 ; m1 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00] - pmaddubsw m1, [r5] - movq xm2, [r0 + r4] ; m2 = row 3 - punpcklbw xm3, xm2 ; m3 = [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] - movq xm0, [r0 + r1 * 4] ; m0 = row 4 - punpcklbw xm2, xm0 ; m2 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - vinserti128 m3, m3, xm2, 1 ; m3 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20] - pmaddubsw m3, [r5 + 1 * mmsize] - paddw m1, m3 -%ifidn %1,pp - pmulhrsw m1, [pw_512] ; m1 = word: row 0, row 1 - packuswb m1, m1 - vextracti128 xm0, m1, 1 - movq [r2], xm1 - movq [r2 + r3], xm0 -%else - add r3d, r3d - psubw m1, [pw_2000] ; m1 = word: row 0, row 1 - vextracti128 xm0, m1, 1 - movu [r2], xm1 - movu [r2 + r3], xm0 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_8x2 pp - FILTER_VER_CHROMA_AVX2_8x2 ps - -%macro FILTER_VER_CHROMA_AVX2_6x8 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_6x8, 4, 6, 7 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 - PROCESS_CHROMA_AVX2_W8_8R -%ifidn %1,pp - lea r4, [r3 * 3] - mova m3, [pw_512] - pmulhrsw m5, m3 ; m5 = word: row 0, row 1 - pmulhrsw m2, m3 ; m2 = word: row 2, row 3 - pmulhrsw m1, m3 ; m1 = word: row 4, row 5 - pmulhrsw m4, m3 ; m4 = word: row 6, row 7 - packuswb m5, m2 - packuswb m1, m4 - vextracti128 xm2, m5, 1 - vextracti128 xm4, m1, 1 - movd [r2], xm5 - pextrw [r2 + 4], xm5, 2 - movd [r2 + r3], xm2 - pextrw [r2 + r3 + 4], xm2, 2 - pextrd [r2 + r3 * 2], xm5, 2 - pextrw [r2 + r3 * 2 + 4], xm5, 6 - pextrd [r2 + r4], xm2, 2 - pextrw [r2 + r4 + 4], xm2, 6 - lea r2, [r2 + r3 * 4] - movd [r2], xm1 - pextrw [r2 + 4], xm1, 2 - movd [r2 + r3], xm4 - pextrw [r2 + r3 + 4], xm4, 2 - pextrd [r2 + r3 * 2], xm1, 2 - pextrw [r2 + r3 * 2 + 4], xm1, 6 - pextrd [r2 + r4], xm4, 2 - pextrw [r2 + r4 + 4], xm4, 6 -%else - add r3d, r3d - vbroadcasti128 m3, [pw_2000] - lea r4, [r3 * 3] - psubw m5, m3 ; m5 = word: row 0, row 1 - psubw m2, m3 ; m2 = word: row 2, row 3 - psubw m1, m3 ; m1 = word: row 4, row 5 - psubw m4, m3 ; m4 = word: row 6, row 7 - vextracti128 xm6, m5, 1 - vextracti128 xm3, m2, 1 - vextracti128 xm0, m1, 1 - movq [r2], xm5 - pextrd [r2 + 8], xm5, 2 - movq [r2 + r3], xm6 - pextrd [r2 + r3 + 8], xm6, 2 - movq [r2 + r3 * 2], xm2 - pextrd [r2 + r3 * 2 + 8], xm2, 2 - movq [r2 + r4], xm3 - pextrd [r2 + r4 + 8], xm3, 2 - lea r2, [r2 + r3 * 4] - movq [r2], xm1 - pextrd [r2 + 8], xm1, 2 - movq [r2 + r3], xm0 - pextrd [r2 + r3 + 8], xm0, 2 - movq [r2 + r3 * 2], xm4 - pextrd [r2 + r3 * 2 + 8], xm4, 2 - vextracti128 xm4, m4, 1 - movq [r2 + r4], xm4 - pextrd [r2 + r4 + 8], xm4, 2 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_6x8 pp - FILTER_VER_CHROMA_AVX2_6x8 ps - -;----------------------------------------------------------------------------- -;void interp_4tap_vert_pp_6x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W6_H4 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_pp_6x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m5, [r5 + r4 * 4] -%else - movd m5, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m6, m5, [tab_Vm] - pshufb m5, [tab_Vm + 16] - mova m4, [pw_512] - - mov r4d, %2 - lea r5, [3 * r1] - -.loop: - movq m0, [r0] - movq m1, [r0 + r1] - movq m2, [r0 + 2 * r1] - movq m3, [r0 + r5] - - punpcklbw m0, m1 - punpcklbw m1, m2 - punpcklbw m2, m3 - - pmaddubsw m0, m6 - pmaddubsw m7, m2, m5 - - paddw m0, m7 - - pmulhrsw m0, m4 - packuswb m0, m0 - movd [r2], m0 - pextrw [r2 + 4], m0, 2 - - lea r0, [r0 + 4 * r1] - - movq m0, [r0] - punpcklbw m3, m0 - - pmaddubsw m1, m6 - pmaddubsw m7, m3, m5 - - paddw m1, m7 - - pmulhrsw m1, m4 - packuswb m1, m1 - movd [r2 + r3], m1 - pextrw [r2 + r3 + 4], m1, 2 - - movq m1, [r0 + r1] - punpcklbw m7, m0, m1 - - pmaddubsw m2, m6 - pmaddubsw m7, m5 - - paddw m2, m7 - - pmulhrsw m2, m4 - packuswb m2, m2 - lea r2, [r2 + 2 * r3] - movd [r2], m2 - pextrw [r2 + 4], m2, 2 - - movq m2, [r0 + 2 * r1] - punpcklbw m1, m2 - - pmaddubsw m3, m6 - pmaddubsw m1, m5 - - paddw m3, m1 - - pmulhrsw m3, m4 - packuswb m3, m3 - - movd [r2 + r3], m3 - pextrw [r2 + r3 + 4], m3, 2 - - lea r2, [r2 + 2 * r3] - - sub r4, 4 - jnz .loop - RET -%endmacro - - FILTER_V4_W6_H4 6, 8 - - FILTER_V4_W6_H4 6, 16 - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W12_H2 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_pp_12x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m1, m0, [tab_Vm] - pshufb m0, [tab_Vm + 16] - - mov r4d, %2 - -.loop: - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - pmaddubsw m4, m1 - pmaddubsw m2, m1 - - lea r0, [r0 + 2 * r1] - movu m5, [r0] - movu m7, [r0 + r1] - - punpcklbw m6, m5, m7 - pmaddubsw m6, m0 - paddw m4, m6 - - punpckhbw m6, m5, m7 - pmaddubsw m6, m0 - paddw m2, m6 - - mova m6, [pw_512] - - pmulhrsw m4, m6 - pmulhrsw m2, m6 - - packuswb m4, m2 - - movh [r2], m4 - pextrd [r2 + 8], m4, 2 - - punpcklbw m4, m3, m5 - punpckhbw m3, m5 - - pmaddubsw m4, m1 - pmaddubsw m3, m1 - - movu m5, [r0 + 2 * r1] - - punpcklbw m2, m7, m5 - punpckhbw m7, m5 - - pmaddubsw m2, m0 - pmaddubsw m7, m0 - - paddw m4, m2 - paddw m3, m7 - - pmulhrsw m4, m6 - pmulhrsw m3, m6 - - packuswb m4, m3 - - movh [r2 + r3], m4 - pextrd [r2 + r3 + 8], m4, 2 - - lea r2, [r2 + 2 * r3] - - sub r4, 2 - jnz .loop - RET -%endmacro - - FILTER_V4_W12_H2 12, 16 - - FILTER_V4_W12_H2 12, 32 - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_16x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W16_H2 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_pp_16x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m1, m0, [tab_Vm] - pshufb m0, [tab_Vm + 16] - - mov r4d, %2/2 - -.loop: - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - pmaddubsw m4, m1 - pmaddubsw m2, m1 - - lea r0, [r0 + 2 * r1] - movu m5, [r0] - movu m6, [r0 + r1] - - punpckhbw m7, m5, m6 - pmaddubsw m7, m0 - paddw m2, m7 - - punpcklbw m7, m5, m6 - pmaddubsw m7, m0 - paddw m4, m7 - - mova m7, [pw_512] - - pmulhrsw m4, m7 - pmulhrsw m2, m7 - - packuswb m4, m2 - - movu [r2], m4 - - punpcklbw m4, m3, m5 - punpckhbw m3, m5 - - pmaddubsw m4, m1 - pmaddubsw m3, m1 - - movu m5, [r0 + 2 * r1] - - punpcklbw m2, m6, m5 - punpckhbw m6, m5 - - pmaddubsw m2, m0 - pmaddubsw m6, m0 - - paddw m4, m2 - paddw m3, m6 - - pmulhrsw m4, m7 - pmulhrsw m3, m7 - - packuswb m4, m3 - - movu [r2 + r3], m4 - - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loop - RET -%endmacro - - FILTER_V4_W16_H2 16, 4 - FILTER_V4_W16_H2 16, 8 - FILTER_V4_W16_H2 16, 12 - FILTER_V4_W16_H2 16, 16 - FILTER_V4_W16_H2 16, 32 - - FILTER_V4_W16_H2 16, 24 - FILTER_V4_W16_H2 16, 64 - -%macro FILTER_VER_CHROMA_AVX2_16x16 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_16x16, 4, 6, 15 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - mova m12, [r5] - mova m13, [r5 + mmsize] - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - mova m14, [pw_512] -%else - add r3d, r3d - vbroadcasti128 m14, [pw_2000] -%endif - lea r5, [r3 * 3] - - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhbw xm2, xm0, xm1 - punpcklbw xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddubsw m0, m12 - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhbw xm3, xm1, xm2 - punpcklbw xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddubsw m1, m12 - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhbw xm4, xm2, xm3 - punpcklbw xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddubsw m4, m2, m13 - paddw m0, m4 - pmaddubsw m2, m12 - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhbw xm5, xm3, xm4 - punpcklbw xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddubsw m5, m3, m13 - paddw m1, m5 - pmaddubsw m3, m12 - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhbw xm6, xm4, xm5 - punpcklbw xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddubsw m6, m4, m13 - paddw m2, m6 - pmaddubsw m4, m12 - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhbw xm7, xm5, xm6 - punpcklbw xm5, xm6 - vinserti128 m5, m5, xm7, 1 - pmaddubsw m7, m5, m13 - paddw m3, m7 - pmaddubsw m5, m12 - movu xm7, [r0 + r4] ; m7 = row 7 - punpckhbw xm8, xm6, xm7 - punpcklbw xm6, xm7 - vinserti128 m6, m6, xm8, 1 - pmaddubsw m8, m6, m13 - paddw m4, m8 - pmaddubsw m6, m12 - lea r0, [r0 + r1 * 4] - movu xm8, [r0] ; m8 = row 8 - punpckhbw xm9, xm7, xm8 - punpcklbw xm7, xm8 - vinserti128 m7, m7, xm9, 1 - pmaddubsw m9, m7, m13 - paddw m5, m9 - pmaddubsw m7, m12 - movu xm9, [r0 + r1] ; m9 = row 9 - punpckhbw xm10, xm8, xm9 - punpcklbw xm8, xm9 - vinserti128 m8, m8, xm10, 1 - pmaddubsw m10, m8, m13 - paddw m6, m10 - pmaddubsw m8, m12 - movu xm10, [r0 + r1 * 2] ; m10 = row 10 - punpckhbw xm11, xm9, xm10 - punpcklbw xm9, xm10 - vinserti128 m9, m9, xm11, 1 - pmaddubsw m11, m9, m13 - paddw m7, m11 - pmaddubsw m9, m12 - -%ifidn %1,pp - pmulhrsw m0, m14 ; m0 = word: row 0 - pmulhrsw m1, m14 ; m1 = word: row 1 - pmulhrsw m2, m14 ; m2 = word: row 2 - pmulhrsw m3, m14 ; m3 = word: row 3 - pmulhrsw m4, m14 ; m4 = word: row 4 - pmulhrsw m5, m14 ; m5 = word: row 5 - pmulhrsw m6, m14 ; m6 = word: row 6 - pmulhrsw m7, m14 ; m7 = word: row 7 - packuswb m0, m1 - packuswb m2, m3 - packuswb m4, m5 - packuswb m6, m7 - vpermq m0, m0, 11011000b - vpermq m2, m2, 11011000b - vpermq m4, m4, 11011000b - vpermq m6, m6, 11011000b - vextracti128 xm1, m0, 1 - vextracti128 xm3, m2, 1 - vextracti128 xm5, m4, 1 - vextracti128 xm7, m6, 1 - movu [r2], xm0 - movu [r2 + r3], xm1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r5], xm3 - lea r2, [r2 + r3 * 4] - movu [r2], xm4 - movu [r2 + r3], xm5 - movu [r2 + r3 * 2], xm6 - movu [r2 + r5], xm7 -%else - psubw m0, m14 ; m0 = word: row 0 - psubw m1, m14 ; m1 = word: row 1 - psubw m2, m14 ; m2 = word: row 2 - psubw m3, m14 ; m3 = word: row 3 - psubw m4, m14 ; m4 = word: row 4 - psubw m5, m14 ; m5 = word: row 5 - psubw m6, m14 ; m6 = word: row 6 - psubw m7, m14 ; m7 = word: row 7 - movu [r2], m0 - movu [r2 + r3], m1 - movu [r2 + r3 * 2], m2 - movu [r2 + r5], m3 - lea r2, [r2 + r3 * 4] - movu [r2], m4 - movu [r2 + r3], m5 - movu [r2 + r3 * 2], m6 - movu [r2 + r5], m7 -%endif - lea r2, [r2 + r3 * 4] - - movu xm11, [r0 + r4] ; m11 = row 11 - punpckhbw xm6, xm10, xm11 - punpcklbw xm10, xm11 - vinserti128 m10, m10, xm6, 1 - pmaddubsw m6, m10, m13 - paddw m8, m6 - pmaddubsw m10, m12 - lea r0, [r0 + r1 * 4] - movu xm6, [r0] ; m6 = row 12 - punpckhbw xm7, xm11, xm6 - punpcklbw xm11, xm6 - vinserti128 m11, m11, xm7, 1 - pmaddubsw m7, m11, m13 - paddw m9, m7 - pmaddubsw m11, m12 - - movu xm7, [r0 + r1] ; m7 = row 13 - punpckhbw xm0, xm6, xm7 - punpcklbw xm6, xm7 - vinserti128 m6, m6, xm0, 1 - pmaddubsw m0, m6, m13 - paddw m10, m0 - pmaddubsw m6, m12 - movu xm0, [r0 + r1 * 2] ; m0 = row 14 - punpckhbw xm1, xm7, xm0 - punpcklbw xm7, xm0 - vinserti128 m7, m7, xm1, 1 - pmaddubsw m1, m7, m13 - paddw m11, m1 - pmaddubsw m7, m12 - movu xm1, [r0 + r4] ; m1 = row 15 - punpckhbw xm2, xm0, xm1 - punpcklbw xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddubsw m2, m0, m13 - paddw m6, m2 - pmaddubsw m0, m12 - lea r0, [r0 + r1 * 4] - movu xm2, [r0] ; m2 = row 16 - punpckhbw xm3, xm1, xm2 - punpcklbw xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddubsw m3, m1, m13 - paddw m7, m3 - pmaddubsw m1, m12 - movu xm3, [r0 + r1] ; m3 = row 17 - punpckhbw xm4, xm2, xm3 - punpcklbw xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddubsw m2, m13 - paddw m0, m2 - movu xm4, [r0 + r1 * 2] ; m4 = row 18 - punpckhbw xm5, xm3, xm4 - punpcklbw xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddubsw m3, m13 - paddw m1, m3 - -%ifidn %1,pp - pmulhrsw m8, m14 ; m8 = word: row 8 - pmulhrsw m9, m14 ; m9 = word: row 9 - pmulhrsw m10, m14 ; m10 = word: row 10 - pmulhrsw m11, m14 ; m11 = word: row 11 - pmulhrsw m6, m14 ; m6 = word: row 12 - pmulhrsw m7, m14 ; m7 = word: row 13 - pmulhrsw m0, m14 ; m0 = word: row 14 - pmulhrsw m1, m14 ; m1 = word: row 15 - packuswb m8, m9 - packuswb m10, m11 - packuswb m6, m7 - packuswb m0, m1 - vpermq m8, m8, 11011000b - vpermq m10, m10, 11011000b - vpermq m6, m6, 11011000b - vpermq m0, m0, 11011000b - vextracti128 xm9, m8, 1 - vextracti128 xm11, m10, 1 - vextracti128 xm7, m6, 1 - vextracti128 xm1, m0, 1 - movu [r2], xm8 - movu [r2 + r3], xm9 - movu [r2 + r3 * 2], xm10 - movu [r2 + r5], xm11 - lea r2, [r2 + r3 * 4] - movu [r2], xm6 - movu [r2 + r3], xm7 - movu [r2 + r3 * 2], xm0 - movu [r2 + r5], xm1 -%else - psubw m8, m14 ; m8 = word: row 8 - psubw m9, m14 ; m9 = word: row 9 - psubw m10, m14 ; m10 = word: row 10 - psubw m11, m14 ; m11 = word: row 11 - psubw m6, m14 ; m6 = word: row 12 - psubw m7, m14 ; m7 = word: row 13 - psubw m0, m14 ; m0 = word: row 14 - psubw m1, m14 ; m1 = word: row 15 - movu [r2], m8 - movu [r2 + r3], m9 - movu [r2 + r3 * 2], m10 - movu [r2 + r5], m11 - lea r2, [r2 + r3 * 4] - movu [r2], m6 - movu [r2 + r3], m7 - movu [r2 + r3 * 2], m0 - movu [r2 + r5], m1 -%endif - RET -%endif -%endmacro - - FILTER_VER_CHROMA_AVX2_16x16 pp - FILTER_VER_CHROMA_AVX2_16x16 ps -%macro FILTER_VER_CHROMA_AVX2_16x8 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_16x8, 4, 7, 7 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - mova m6, [pw_512] -%else - add r3d, r3d - mova m6, [pw_2000] -%endif - lea r6, [r3 * 3] - - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhbw xm2, xm0, xm1 - punpcklbw xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddubsw m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhbw xm3, xm1, xm2 - punpcklbw xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddubsw m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhbw xm4, xm2, xm3 - punpcklbw xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddubsw m4, m2, [r5 + mmsize] - paddw m0, m4 - pmaddubsw m2, [r5] - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhbw xm5, xm3, xm4 - punpcklbw xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddubsw m5, m3, [r5 + mmsize] - paddw m1, m5 - pmaddubsw m3, [r5] -%ifidn %1,pp - pmulhrsw m0, m6 ; m0 = word: row 0 - pmulhrsw m1, m6 ; m1 = word: row 1 - packuswb m0, m1 - vpermq m0, m0, 11011000b - vextracti128 xm1, m0, 1 - movu [r2], xm0 - movu [r2 + r3], xm1 -%else - psubw m0, m6 ; m0 = word: row 0 - psubw m1, m6 ; m1 = word: row 1 - movu [r2], m0 - movu [r2 + r3], m1 -%endif - - movu xm0, [r0 + r1] ; m0 = row 5 - punpckhbw xm1, xm4, xm0 - punpcklbw xm4, xm0 - vinserti128 m4, m4, xm1, 1 - pmaddubsw m1, m4, [r5 + mmsize] - paddw m2, m1 - pmaddubsw m4, [r5] - movu xm1, [r0 + r1 * 2] ; m1 = row 6 - punpckhbw xm5, xm0, xm1 - punpcklbw xm0, xm1 - vinserti128 m0, m0, xm5, 1 - pmaddubsw m5, m0, [r5 + mmsize] - paddw m3, m5 - pmaddubsw m0, [r5] -%ifidn %1,pp - pmulhrsw m2, m6 ; m2 = word: row 2 - pmulhrsw m3, m6 ; m3 = word: row 3 - packuswb m2, m3 - vpermq m2, m2, 11011000b - vextracti128 xm3, m2, 1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm3 -%else - psubw m2, m6 ; m2 = word: row 2 - psubw m3, m6 ; m3 = word: row 3 - movu [r2 + r3 * 2], m2 - movu [r2 + r6], m3 -%endif - - movu xm2, [r0 + r4] ; m2 = row 7 - punpckhbw xm3, xm1, xm2 - punpcklbw xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddubsw m3, m1, [r5 + mmsize] - paddw m4, m3 - pmaddubsw m1, [r5] - lea r0, [r0 + r1 * 4] - movu xm3, [r0] ; m3 = row 8 - punpckhbw xm5, xm2, xm3 - punpcklbw xm2, xm3 - vinserti128 m2, m2, xm5, 1 - pmaddubsw m5, m2, [r5 + mmsize] - paddw m0, m5 - pmaddubsw m2, [r5] - lea r2, [r2 + r3 * 4] -%ifidn %1,pp - pmulhrsw m4, m6 ; m4 = word: row 4 - pmulhrsw m0, m6 ; m0 = word: row 5 - packuswb m4, m0 - vpermq m4, m4, 11011000b - vextracti128 xm0, m4, 1 - movu [r2], xm4 - movu [r2 + r3], xm0 -%else - psubw m4, m6 ; m4 = word: row 4 - psubw m0, m6 ; m0 = word: row 5 - movu [r2], m4 - movu [r2 + r3], m0 -%endif - - movu xm5, [r0 + r1] ; m5 = row 9 - punpckhbw xm4, xm3, xm5 - punpcklbw xm3, xm5 - vinserti128 m3, m3, xm4, 1 - pmaddubsw m3, [r5 + mmsize] - paddw m1, m3 - movu xm4, [r0 + r1 * 2] ; m4 = row 10 - punpckhbw xm0, xm5, xm4 - punpcklbw xm5, xm4 - vinserti128 m5, m5, xm0, 1 - pmaddubsw m5, [r5 + mmsize] - paddw m2, m5 -%ifidn %1,pp - pmulhrsw m1, m6 ; m1 = word: row 6 - pmulhrsw m2, m6 ; m2 = word: row 7 - packuswb m1, m2 - vpermq m1, m1, 11011000b - vextracti128 xm2, m1, 1 - movu [r2 + r3 * 2], xm1 - movu [r2 + r6], xm2 -%else - psubw m1, m6 ; m1 = word: row 6 - psubw m2, m6 ; m2 = word: row 7 - movu [r2 + r3 * 2], m1 - movu [r2 + r6], m2 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_16x8 pp - FILTER_VER_CHROMA_AVX2_16x8 ps - -%macro FILTER_VER_CHROMA_AVX2_16x12 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_16x12, 4, 6, 10 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - mova m8, [r5] - mova m9, [r5 + mmsize] - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - mova m7, [pw_512] -%else - add r3d, r3d - vbroadcasti128 m7, [pw_2000] -%endif - lea r5, [r3 * 3] - - movu xm0, [r0] - vinserti128 m0, m0, [r0 + r1 * 2], 1 - movu xm1, [r0 + r1] - vinserti128 m1, m1, [r0 + r4], 1 - - punpcklbw m2, m0, m1 - punpckhbw m3, m0, m1 - vperm2i128 m4, m2, m3, 0x20 - vperm2i128 m2, m2, m3, 0x31 - pmaddubsw m4, m8 - pmaddubsw m3, m2, m9 - paddw m4, m3 - pmaddubsw m2, m8 - - vextracti128 xm0, m0, 1 - lea r0, [r0 + r1 * 4] - vinserti128 m0, m0, [r0], 1 - - punpcklbw m5, m1, m0 - punpckhbw m3, m1, m0 - vperm2i128 m6, m5, m3, 0x20 - vperm2i128 m5, m5, m3, 0x31 - pmaddubsw m6, m8 - pmaddubsw m3, m5, m9 - paddw m6, m3 - pmaddubsw m5, m8 -%ifidn %1,pp - pmulhrsw m4, m7 ; m4 = word: row 0 - pmulhrsw m6, m7 ; m6 = word: row 1 - packuswb m4, m6 - vpermq m4, m4, 11011000b - vextracti128 xm6, m4, 1 - movu [r2], xm4 - movu [r2 + r3], xm6 -%else - psubw m4, m7 ; m4 = word: row 0 - psubw m6, m7 ; m6 = word: row 1 - movu [r2], m4 - movu [r2 + r3], m6 -%endif - - movu xm4, [r0 + r1 * 2] - vinserti128 m4, m4, [r0 + r1], 1 - vextracti128 xm1, m4, 1 - vinserti128 m0, m0, xm1, 0 - - punpcklbw m6, m0, m4 - punpckhbw m1, m0, m4 - vperm2i128 m0, m6, m1, 0x20 - vperm2i128 m6, m6, m1, 0x31 - pmaddubsw m1, m0, m9 - paddw m5, m1 - pmaddubsw m0, m8 - pmaddubsw m1, m6, m9 - paddw m2, m1 - pmaddubsw m6, m8 - -%ifidn %1,pp - pmulhrsw m2, m7 ; m2 = word: row 2 - pmulhrsw m5, m7 ; m5 = word: row 3 - packuswb m2, m5 - vpermq m2, m2, 11011000b - vextracti128 xm5, m2, 1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r5], xm5 -%else - psubw m2, m7 ; m2 = word: row 2 - psubw m5, m7 ; m5 = word: row 3 - movu [r2 + r3 * 2], m2 - movu [r2 + r5], m5 -%endif - lea r2, [r2 + r3 * 4] - - movu xm1, [r0 + r4] - lea r0, [r0 + r1 * 4] - vinserti128 m1, m1, [r0], 1 - vinserti128 m4, m4, xm1, 1 - - punpcklbw m2, m4, m1 - punpckhbw m5, m4, m1 - vperm2i128 m3, m2, m5, 0x20 - vperm2i128 m2, m2, m5, 0x31 - pmaddubsw m5, m3, m9 - paddw m6, m5 - pmaddubsw m3, m8 - pmaddubsw m5, m2, m9 - paddw m0, m5 - pmaddubsw m2, m8 - -%ifidn %1,pp - pmulhrsw m6, m7 ; m6 = word: row 4 - pmulhrsw m0, m7 ; m0 = word: row 5 - packuswb m6, m0 - vpermq m6, m6, 11011000b - vextracti128 xm0, m6, 1 - movu [r2], xm6 - movu [r2 + r3], xm0 -%else - psubw m6, m7 ; m6 = word: row 4 - psubw m0, m7 ; m0 = word: row 5 - movu [r2], m6 - movu [r2 + r3], m0 -%endif - - movu xm6, [r0 + r1 * 2] - vinserti128 m6, m6, [r0 + r1], 1 - vextracti128 xm0, m6, 1 - vinserti128 m1, m1, xm0, 0 - - punpcklbw m4, m1, m6 - punpckhbw m5, m1, m6 - vperm2i128 m0, m4, m5, 0x20 - vperm2i128 m5, m4, m5, 0x31 - pmaddubsw m4, m0, m9 - paddw m2, m4 - pmaddubsw m0, m8 - pmaddubsw m4, m5, m9 - paddw m3, m4 - pmaddubsw m5, m8 - -%ifidn %1,pp - pmulhrsw m3, m7 ; m3 = word: row 6 - pmulhrsw m2, m7 ; m2 = word: row 7 - packuswb m3, m2 - vpermq m3, m3, 11011000b - vextracti128 xm2, m3, 1 - movu [r2 + r3 * 2], xm3 - movu [r2 + r5], xm2 -%else - psubw m3, m7 ; m3 = word: row 6 - psubw m2, m7 ; m2 = word: row 7 - movu [r2 + r3 * 2], m3 - movu [r2 + r5], m2 -%endif - lea r2, [r2 + r3 * 4] - - movu xm3, [r0 + r4] - lea r0, [r0 + r1 * 4] - vinserti128 m3, m3, [r0], 1 - vinserti128 m6, m6, xm3, 1 - - punpcklbw m2, m6, m3 - punpckhbw m1, m6, m3 - vperm2i128 m4, m2, m1, 0x20 - vperm2i128 m2, m2, m1, 0x31 - pmaddubsw m1, m4, m9 - paddw m5, m1 - pmaddubsw m4, m8 - pmaddubsw m1, m2, m9 - paddw m0, m1 - pmaddubsw m2, m8 - -%ifidn %1,pp - pmulhrsw m5, m7 ; m5 = word: row 8 - pmulhrsw m0, m7 ; m0 = word: row 9 - packuswb m5, m0 - vpermq m5, m5, 11011000b - vextracti128 xm0, m5, 1 - movu [r2], xm5 - movu [r2 + r3], xm0 -%else - psubw m5, m7 ; m5 = word: row 8 - psubw m0, m7 ; m0 = word: row 9 - movu [r2], m5 - movu [r2 + r3], m0 -%endif - - movu xm5, [r0 + r1 * 2] - vinserti128 m5, m5, [r0 + r1], 1 - vextracti128 xm0, m5, 1 - vinserti128 m3, m3, xm0, 0 - - punpcklbw m1, m3, m5 - punpckhbw m0, m3, m5 - vperm2i128 m6, m1, m0, 0x20 - vperm2i128 m0, m1, m0, 0x31 - pmaddubsw m1, m6, m9 - paddw m2, m1 - pmaddubsw m1, m0, m9 - paddw m4, m1 - -%ifidn %1,pp - pmulhrsw m4, m7 ; m4 = word: row 10 - pmulhrsw m2, m7 ; m2 = word: row 11 - packuswb m4, m2 - vpermq m4, m4, 11011000b - vextracti128 xm2, m4, 1 - movu [r2 + r3 * 2], xm4 - movu [r2 + r5], xm2 -%else - psubw m4, m7 ; m4 = word: row 10 - psubw m2, m7 ; m2 = word: row 11 - movu [r2 + r3 * 2], m4 - movu [r2 + r5], m2 -%endif - RET -%endif -%endmacro - - FILTER_VER_CHROMA_AVX2_16x12 pp - FILTER_VER_CHROMA_AVX2_16x12 ps - -%macro FILTER_VER_CHROMA_AVX2_16xN 2 -%if ARCH_X86_64 == 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_16x%2, 4, 8, 8 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - mova m7, [pw_512] -%else - add r3d, r3d - mova m7, [pw_2000] -%endif - lea r6, [r3 * 3] - mov r7d, %2 / 16 -.loopH: - movu xm0, [r0] - vinserti128 m0, m0, [r0 + r1 * 2], 1 - movu xm1, [r0 + r1] - vinserti128 m1, m1, [r0 + r4], 1 - - punpcklbw m2, m0, m1 - punpckhbw m3, m0, m1 - vperm2i128 m4, m2, m3, 0x20 - vperm2i128 m2, m2, m3, 0x31 - pmaddubsw m4, [r5] - pmaddubsw m3, m2, [r5 + mmsize] - paddw m4, m3 - pmaddubsw m2, [r5] - - vextracti128 xm0, m0, 1 - lea r0, [r0 + r1 * 4] - vinserti128 m0, m0, [r0], 1 - - punpcklbw m5, m1, m0 - punpckhbw m3, m1, m0 - vperm2i128 m6, m5, m3, 0x20 - vperm2i128 m5, m5, m3, 0x31 - pmaddubsw m6, [r5] - pmaddubsw m3, m5, [r5 + mmsize] - paddw m6, m3 - pmaddubsw m5, [r5] -%ifidn %1,pp - pmulhrsw m4, m7 ; m4 = word: row 0 - pmulhrsw m6, m7 ; m6 = word: row 1 - packuswb m4, m6 - vpermq m4, m4, 11011000b - vextracti128 xm6, m4, 1 - movu [r2], xm4 - movu [r2 + r3], xm6 -%else - psubw m4, m7 ; m4 = word: row 0 - psubw m6, m7 ; m6 = word: row 1 - movu [r2], m4 - movu [r2 + r3], m6 -%endif - - movu xm4, [r0 + r1 * 2] - vinserti128 m4, m4, [r0 + r1], 1 - vextracti128 xm1, m4, 1 - vinserti128 m0, m0, xm1, 0 - - punpcklbw m6, m0, m4 - punpckhbw m1, m0, m4 - vperm2i128 m0, m6, m1, 0x20 - vperm2i128 m6, m6, m1, 0x31 - pmaddubsw m1, m0, [r5 + mmsize] - paddw m5, m1 - pmaddubsw m0, [r5] - pmaddubsw m1, m6, [r5 + mmsize] - paddw m2, m1 - pmaddubsw m6, [r5] - -%ifidn %1,pp - pmulhrsw m2, m7 ; m2 = word: row 2 - pmulhrsw m5, m7 ; m5 = word: row 3 - packuswb m2, m5 - vpermq m2, m2, 11011000b - vextracti128 xm5, m2, 1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm5 -%else - psubw m2, m7 ; m2 = word: row 2 - psubw m5, m7 ; m5 = word: row 3 - movu [r2 + r3 * 2], m2 - movu [r2 + r6], m5 -%endif - lea r2, [r2 + r3 * 4] - - movu xm1, [r0 + r4] - lea r0, [r0 + r1 * 4] - vinserti128 m1, m1, [r0], 1 - vinserti128 m4, m4, xm1, 1 - - punpcklbw m2, m4, m1 - punpckhbw m5, m4, m1 - vperm2i128 m3, m2, m5, 0x20 - vperm2i128 m2, m2, m5, 0x31 - pmaddubsw m5, m3, [r5 + mmsize] - paddw m6, m5 - pmaddubsw m3, [r5] - pmaddubsw m5, m2, [r5 + mmsize] - paddw m0, m5 - pmaddubsw m2, [r5] - -%ifidn %1,pp - pmulhrsw m6, m7 ; m6 = word: row 4 - pmulhrsw m0, m7 ; m0 = word: row 5 - packuswb m6, m0 - vpermq m6, m6, 11011000b - vextracti128 xm0, m6, 1 - movu [r2], xm6 - movu [r2 + r3], xm0 -%else - psubw m6, m7 ; m6 = word: row 4 - psubw m0, m7 ; m0 = word: row 5 - movu [r2], m6 - movu [r2 + r3], m0 -%endif - - movu xm6, [r0 + r1 * 2] - vinserti128 m6, m6, [r0 + r1], 1 - vextracti128 xm0, m6, 1 - vinserti128 m1, m1, xm0, 0 - - punpcklbw m4, m1, m6 - punpckhbw m5, m1, m6 - vperm2i128 m0, m4, m5, 0x20 - vperm2i128 m5, m4, m5, 0x31 - pmaddubsw m4, m0, [r5 + mmsize] - paddw m2, m4 - pmaddubsw m0, [r5] - pmaddubsw m4, m5, [r5 + mmsize] - paddw m3, m4 - pmaddubsw m5, [r5] - -%ifidn %1,pp - pmulhrsw m3, m7 ; m3 = word: row 6 - pmulhrsw m2, m7 ; m2 = word: row 7 - packuswb m3, m2 - vpermq m3, m3, 11011000b - vextracti128 xm2, m3, 1 - movu [r2 + r3 * 2], xm3 - movu [r2 + r6], xm2 -%else - psubw m3, m7 ; m3 = word: row 6 - psubw m2, m7 ; m2 = word: row 7 - movu [r2 + r3 * 2], m3 - movu [r2 + r6], m2 -%endif - lea r2, [r2 + r3 * 4] - - movu xm3, [r0 + r4] - lea r0, [r0 + r1 * 4] - vinserti128 m3, m3, [r0], 1 - vinserti128 m6, m6, xm3, 1 - - punpcklbw m2, m6, m3 - punpckhbw m1, m6, m3 - vperm2i128 m4, m2, m1, 0x20 - vperm2i128 m2, m2, m1, 0x31 - pmaddubsw m1, m4, [r5 + mmsize] - paddw m5, m1 - pmaddubsw m4, [r5] - pmaddubsw m1, m2, [r5 + mmsize] - paddw m0, m1 - pmaddubsw m2, [r5] - -%ifidn %1,pp - pmulhrsw m5, m7 ; m5 = word: row 8 - pmulhrsw m0, m7 ; m0 = word: row 9 - packuswb m5, m0 - vpermq m5, m5, 11011000b - vextracti128 xm0, m5, 1 - movu [r2], xm5 - movu [r2 + r3], xm0 -%else - psubw m5, m7 ; m5 = word: row 8 - psubw m0, m7 ; m0 = word: row 9 - movu [r2], m5 - movu [r2 + r3], m0 -%endif - - movu xm5, [r0 + r1 * 2] - vinserti128 m5, m5, [r0 + r1], 1 - vextracti128 xm0, m5, 1 - vinserti128 m3, m3, xm0, 0 - - punpcklbw m1, m3, m5 - punpckhbw m0, m3, m5 - vperm2i128 m6, m1, m0, 0x20 - vperm2i128 m0, m1, m0, 0x31 - pmaddubsw m1, m6, [r5 + mmsize] - paddw m2, m1 - pmaddubsw m6, [r5] - pmaddubsw m1, m0, [r5 + mmsize] - paddw m4, m1 - pmaddubsw m0, [r5] - -%ifidn %1,pp - pmulhrsw m4, m7 ; m4 = word: row 10 - pmulhrsw m2, m7 ; m2 = word: row 11 - packuswb m4, m2 - vpermq m4, m4, 11011000b - vextracti128 xm2, m4, 1 - movu [r2 + r3 * 2], xm4 - movu [r2 + r6], xm2 -%else - psubw m4, m7 ; m4 = word: row 10 - psubw m2, m7 ; m2 = word: row 11 - movu [r2 + r3 * 2], m4 - movu [r2 + r6], m2 -%endif - lea r2, [r2 + r3 * 4] - - movu xm3, [r0 + r4] - lea r0, [r0 + r1 * 4] - vinserti128 m3, m3, [r0], 1 - vinserti128 m5, m5, xm3, 1 - - punpcklbw m2, m5, m3 - punpckhbw m1, m5, m3 - vperm2i128 m4, m2, m1, 0x20 - vperm2i128 m2, m2, m1, 0x31 - pmaddubsw m1, m4, [r5 + mmsize] - paddw m0, m1 - pmaddubsw m4, [r5] - pmaddubsw m1, m2, [r5 + mmsize] - paddw m6, m1 - pmaddubsw m2, [r5] - -%ifidn %1,pp - pmulhrsw m0, m7 ; m0 = word: row 12 - pmulhrsw m6, m7 ; m6 = word: row 13 - packuswb m0, m6 - vpermq m0, m0, 11011000b - vextracti128 xm6, m0, 1 - movu [r2], xm0 - movu [r2 + r3], xm6 -%else - psubw m0, m7 ; m0 = word: row 12 - psubw m6, m7 ; m6 = word: row 13 - movu [r2], m0 - movu [r2 + r3], m6 -%endif - - movu xm5, [r0 + r1 * 2] - vinserti128 m5, m5, [r0 + r1], 1 - vextracti128 xm0, m5, 1 - vinserti128 m3, m3, xm0, 0 - - punpcklbw m1, m3, m5 - punpckhbw m0, m3, m5 - vperm2i128 m6, m1, m0, 0x20 - vperm2i128 m0, m1, m0, 0x31 - pmaddubsw m6, [r5 + mmsize] - paddw m2, m6 - pmaddubsw m0, [r5 + mmsize] - paddw m4, m0 - -%ifidn %1,pp - pmulhrsw m4, m7 ; m4 = word: row 14 - pmulhrsw m2, m7 ; m2 = word: row 15 - packuswb m4, m2 - vpermq m4, m4, 11011000b - vextracti128 xm2, m4, 1 - movu [r2 + r3 * 2], xm4 - movu [r2 + r6], xm2 -%else - psubw m4, m7 ; m4 = word: row 14 - psubw m2, m7 ; m2 = word: row 15 - movu [r2 + r3 * 2], m4 - movu [r2 + r6], m2 -%endif - lea r2, [r2 + r3 * 4] - dec r7d - jnz .loopH - RET -%endif -%endmacro - - FILTER_VER_CHROMA_AVX2_16xN pp, 32 - FILTER_VER_CHROMA_AVX2_16xN ps, 32 - FILTER_VER_CHROMA_AVX2_16xN pp, 64 - FILTER_VER_CHROMA_AVX2_16xN ps, 64 - -%macro FILTER_VER_CHROMA_AVX2_16x24 1 -%if ARCH_X86_64 == 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_16x24, 4, 6, 15 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - mova m12, [r5] - mova m13, [r5 + mmsize] - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - mova m14, [pw_512] -%else - add r3d, r3d - vbroadcasti128 m14, [pw_2000] -%endif - lea r5, [r3 * 3] - - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhbw xm2, xm0, xm1 - punpcklbw xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddubsw m0, m12 - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhbw xm3, xm1, xm2 - punpcklbw xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddubsw m1, m12 - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhbw xm4, xm2, xm3 - punpcklbw xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddubsw m4, m2, m13 - paddw m0, m4 - pmaddubsw m2, m12 - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhbw xm5, xm3, xm4 - punpcklbw xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddubsw m5, m3, m13 - paddw m1, m5 - pmaddubsw m3, m12 - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhbw xm6, xm4, xm5 - punpcklbw xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddubsw m6, m4, m13 - paddw m2, m6 - pmaddubsw m4, m12 - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhbw xm7, xm5, xm6 - punpcklbw xm5, xm6 - vinserti128 m5, m5, xm7, 1 - pmaddubsw m7, m5, m13 - paddw m3, m7 - pmaddubsw m5, m12 - movu xm7, [r0 + r4] ; m7 = row 7 - punpckhbw xm8, xm6, xm7 - punpcklbw xm6, xm7 - vinserti128 m6, m6, xm8, 1 - pmaddubsw m8, m6, m13 - paddw m4, m8 - pmaddubsw m6, m12 - lea r0, [r0 + r1 * 4] - movu xm8, [r0] ; m8 = row 8 - punpckhbw xm9, xm7, xm8 - punpcklbw xm7, xm8 - vinserti128 m7, m7, xm9, 1 - pmaddubsw m9, m7, m13 - paddw m5, m9 - pmaddubsw m7, m12 - movu xm9, [r0 + r1] ; m9 = row 9 - punpckhbw xm10, xm8, xm9 - punpcklbw xm8, xm9 - vinserti128 m8, m8, xm10, 1 - pmaddubsw m10, m8, m13 - paddw m6, m10 - pmaddubsw m8, m12 - movu xm10, [r0 + r1 * 2] ; m10 = row 10 - punpckhbw xm11, xm9, xm10 - punpcklbw xm9, xm10 - vinserti128 m9, m9, xm11, 1 - pmaddubsw m11, m9, m13 - paddw m7, m11 - pmaddubsw m9, m12 - -%ifidn %1,pp - pmulhrsw m0, m14 ; m0 = word: row 0 - pmulhrsw m1, m14 ; m1 = word: row 1 - pmulhrsw m2, m14 ; m2 = word: row 2 - pmulhrsw m3, m14 ; m3 = word: row 3 - pmulhrsw m4, m14 ; m4 = word: row 4 - pmulhrsw m5, m14 ; m5 = word: row 5 - pmulhrsw m6, m14 ; m6 = word: row 6 - pmulhrsw m7, m14 ; m7 = word: row 7 - packuswb m0, m1 - packuswb m2, m3 - packuswb m4, m5 - packuswb m6, m7 - vpermq m0, m0, q3120 - vpermq m2, m2, q3120 - vpermq m4, m4, q3120 - vpermq m6, m6, q3120 - vextracti128 xm1, m0, 1 - vextracti128 xm3, m2, 1 - vextracti128 xm5, m4, 1 - vextracti128 xm7, m6, 1 - movu [r2], xm0 - movu [r2 + r3], xm1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r5], xm3 - lea r2, [r2 + r3 * 4] - movu [r2], xm4 - movu [r2 + r3], xm5 - movu [r2 + r3 * 2], xm6 - movu [r2 + r5], xm7 -%else - psubw m0, m14 ; m0 = word: row 0 - psubw m1, m14 ; m1 = word: row 1 - psubw m2, m14 ; m2 = word: row 2 - psubw m3, m14 ; m3 = word: row 3 - psubw m4, m14 ; m4 = word: row 4 - psubw m5, m14 ; m5 = word: row 5 - psubw m6, m14 ; m6 = word: row 6 - psubw m7, m14 ; m7 = word: row 7 - movu [r2], m0 - movu [r2 + r3], m1 - movu [r2 + r3 * 2], m2 - movu [r2 + r5], m3 - lea r2, [r2 + r3 * 4] - movu [r2], m4 - movu [r2 + r3], m5 - movu [r2 + r3 * 2], m6 - movu [r2 + r5], m7 -%endif - lea r2, [r2 + r3 * 4] - - movu xm11, [r0 + r4] ; m11 = row 11 - punpckhbw xm6, xm10, xm11 - punpcklbw xm10, xm11 - vinserti128 m10, m10, xm6, 1 - pmaddubsw m6, m10, m13 - paddw m8, m6 - pmaddubsw m10, m12 - lea r0, [r0 + r1 * 4] - movu xm6, [r0] ; m6 = row 12 - punpckhbw xm7, xm11, xm6 - punpcklbw xm11, xm6 - vinserti128 m11, m11, xm7, 1 - pmaddubsw m7, m11, m13 - paddw m9, m7 - pmaddubsw m11, m12 - - movu xm7, [r0 + r1] ; m7 = row 13 - punpckhbw xm0, xm6, xm7 - punpcklbw xm6, xm7 - vinserti128 m6, m6, xm0, 1 - pmaddubsw m0, m6, m13 - paddw m10, m0 - pmaddubsw m6, m12 - movu xm0, [r0 + r1 * 2] ; m0 = row 14 - punpckhbw xm1, xm7, xm0 - punpcklbw xm7, xm0 - vinserti128 m7, m7, xm1, 1 - pmaddubsw m1, m7, m13 - paddw m11, m1 - pmaddubsw m7, m12 - movu xm1, [r0 + r4] ; m1 = row 15 - punpckhbw xm2, xm0, xm1 - punpcklbw xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddubsw m2, m0, m13 - paddw m6, m2 - pmaddubsw m0, m12 - lea r0, [r0 + r1 * 4] - movu xm2, [r0] ; m2 = row 16 - punpckhbw xm3, xm1, xm2 - punpcklbw xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddubsw m3, m1, m13 - paddw m7, m3 - pmaddubsw m1, m12 - movu xm3, [r0 + r1] ; m3 = row 17 - punpckhbw xm4, xm2, xm3 - punpcklbw xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddubsw m4, m2, m13 - paddw m0, m4 - pmaddubsw m2, m12 - movu xm4, [r0 + r1 * 2] ; m4 = row 18 - punpckhbw xm5, xm3, xm4 - punpcklbw xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddubsw m5, m3, m13 - paddw m1, m5 - pmaddubsw m3, m12 - -%ifidn %1,pp - pmulhrsw m8, m14 ; m8 = word: row 8 - pmulhrsw m9, m14 ; m9 = word: row 9 - pmulhrsw m10, m14 ; m10 = word: row 10 - pmulhrsw m11, m14 ; m11 = word: row 11 - pmulhrsw m6, m14 ; m6 = word: row 12 - pmulhrsw m7, m14 ; m7 = word: row 13 - pmulhrsw m0, m14 ; m0 = word: row 14 - pmulhrsw m1, m14 ; m1 = word: row 15 - packuswb m8, m9 - packuswb m10, m11 - packuswb m6, m7 - packuswb m0, m1 - vpermq m8, m8, q3120 - vpermq m10, m10, q3120 - vpermq m6, m6, q3120 - vpermq m0, m0, q3120 - vextracti128 xm9, m8, 1 - vextracti128 xm11, m10, 1 - vextracti128 xm7, m6, 1 - vextracti128 xm1, m0, 1 - movu [r2], xm8 - movu [r2 + r3], xm9 - movu [r2 + r3 * 2], xm10 - movu [r2 + r5], xm11 - lea r2, [r2 + r3 * 4] - movu [r2], xm6 - movu [r2 + r3], xm7 - movu [r2 + r3 * 2], xm0 - movu [r2 + r5], xm1 -%else - psubw m8, m14 ; m8 = word: row 8 - psubw m9, m14 ; m9 = word: row 9 - psubw m10, m14 ; m10 = word: row 10 - psubw m11, m14 ; m11 = word: row 11 - psubw m6, m14 ; m6 = word: row 12 - psubw m7, m14 ; m7 = word: row 13 - psubw m0, m14 ; m0 = word: row 14 - psubw m1, m14 ; m1 = word: row 15 - movu [r2], m8 - movu [r2 + r3], m9 - movu [r2 + r3 * 2], m10 - movu [r2 + r5], m11 - lea r2, [r2 + r3 * 4] - movu [r2], m6 - movu [r2 + r3], m7 - movu [r2 + r3 * 2], m0 - movu [r2 + r5], m1 -%endif - lea r2, [r2 + r3 * 4] - - movu xm5, [r0 + r4] ; m5 = row 19 - punpckhbw xm6, xm4, xm5 - punpcklbw xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddubsw m6, m4, m13 - paddw m2, m6 - pmaddubsw m4, m12 - lea r0, [r0 + r1 * 4] - movu xm6, [r0] ; m6 = row 20 - punpckhbw xm7, xm5, xm6 - punpcklbw xm5, xm6 - vinserti128 m5, m5, xm7, 1 - pmaddubsw m7, m5, m13 - paddw m3, m7 - pmaddubsw m5, m12 - movu xm7, [r0 + r1] ; m7 = row 21 - punpckhbw xm0, xm6, xm7 - punpcklbw xm6, xm7 - vinserti128 m6, m6, xm0, 1 - pmaddubsw m0, m6, m13 - paddw m4, m0 - pmaddubsw m6, m12 - movu xm0, [r0 + r1 * 2] ; m0 = row 22 - punpckhbw xm1, xm7, xm0 - punpcklbw xm7, xm0 - vinserti128 m7, m7, xm1, 1 - pmaddubsw m1, m7, m13 - paddw m5, m1 - pmaddubsw m7, m12 - movu xm1, [r0 + r4] ; m1 = row 23 - punpckhbw xm8, xm0, xm1 - punpcklbw xm0, xm1 - vinserti128 m0, m0, xm8, 1 - pmaddubsw m8, m0, m13 - paddw m6, m8 - pmaddubsw m0, m12 - lea r0, [r0 + r1 * 4] - movu xm8, [r0] ; m8 = row 24 - punpckhbw xm9, xm1, xm8 - punpcklbw xm1, xm8 - vinserti128 m1, m1, xm9, 1 - pmaddubsw m9, m1, m13 - paddw m7, m9 - pmaddubsw m1, m12 - movu xm9, [r0 + r1] ; m9 = row 25 - punpckhbw xm10, xm8, xm9 - punpcklbw xm8, xm9 - vinserti128 m8, m8, xm10, 1 - pmaddubsw m8, m13 - paddw m0, m8 - movu xm10, [r0 + r1 * 2] ; m10 = row 26 - punpckhbw xm11, xm9, xm10 - punpcklbw xm9, xm10 - vinserti128 m9, m9, xm11, 1 - pmaddubsw m9, m13 - paddw m1, m9 - -%ifidn %1,pp - pmulhrsw m2, m14 ; m2 = word: row 16 - pmulhrsw m3, m14 ; m3 = word: row 17 - pmulhrsw m4, m14 ; m4 = word: row 18 - pmulhrsw m5, m14 ; m5 = word: row 19 - pmulhrsw m6, m14 ; m6 = word: row 20 - pmulhrsw m7, m14 ; m7 = word: row 21 - pmulhrsw m0, m14 ; m0 = word: row 22 - pmulhrsw m1, m14 ; m1 = word: row 23 - packuswb m2, m3 - packuswb m4, m5 - packuswb m6, m7 - packuswb m0, m1 - vpermq m2, m2, q3120 - vpermq m4, m4, q3120 - vpermq m6, m6, q3120 - vpermq m0, m0, q3120 - vextracti128 xm3, m2, 1 - vextracti128 xm5, m4, 1 - vextracti128 xm7, m6, 1 - vextracti128 xm1, m0, 1 - movu [r2], xm2 - movu [r2 + r3], xm3 - movu [r2 + r3 * 2], xm4 - movu [r2 + r5], xm5 - lea r2, [r2 + r3 * 4] - movu [r2], xm6 - movu [r2 + r3], xm7 - movu [r2 + r3 * 2], xm0 - movu [r2 + r5], xm1 -%else - psubw m2, m14 ; m2 = word: row 16 - psubw m3, m14 ; m3 = word: row 17 - psubw m4, m14 ; m4 = word: row 18 - psubw m5, m14 ; m5 = word: row 19 - psubw m6, m14 ; m6 = word: row 20 - psubw m7, m14 ; m7 = word: row 21 - psubw m0, m14 ; m0 = word: row 22 - psubw m1, m14 ; m1 = word: row 23 - movu [r2], m2 - movu [r2 + r3], m3 - movu [r2 + r3 * 2], m4 - movu [r2 + r5], m5 - lea r2, [r2 + r3 * 4] - movu [r2], m6 - movu [r2 + r3], m7 - movu [r2 + r3 * 2], m0 - movu [r2 + r5], m1 -%endif - RET -%endif -%endmacro - - FILTER_VER_CHROMA_AVX2_16x24 pp - FILTER_VER_CHROMA_AVX2_16x24 ps - -%macro FILTER_VER_CHROMA_AVX2_24x32 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_24x32, 4, 9, 10 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - mova m8, [r5] - mova m9, [r5 + mmsize] - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - mova m7, [pw_512] -%else - add r3d, r3d - vbroadcasti128 m7, [pw_2000] -%endif - lea r6, [r3 * 3] - mov r5d, 2 -.loopH: - movu xm0, [r0] - vinserti128 m0, m0, [r0 + r1 * 2], 1 - movu xm1, [r0 + r1] - vinserti128 m1, m1, [r0 + r4], 1 - - punpcklbw m2, m0, m1 - punpckhbw m3, m0, m1 - vperm2i128 m4, m2, m3, 0x20 - vperm2i128 m2, m2, m3, 0x31 - pmaddubsw m4, m8 - pmaddubsw m3, m2, m9 - paddw m4, m3 - pmaddubsw m2, m8 - - vextracti128 xm0, m0, 1 - lea r7, [r0 + r1 * 4] - vinserti128 m0, m0, [r7], 1 - - punpcklbw m5, m1, m0 - punpckhbw m3, m1, m0 - vperm2i128 m6, m5, m3, 0x20 - vperm2i128 m5, m5, m3, 0x31 - pmaddubsw m6, m8 - pmaddubsw m3, m5, m9 - paddw m6, m3 - pmaddubsw m5, m8 -%ifidn %1,pp - pmulhrsw m4, m7 ; m4 = word: row 0 - pmulhrsw m6, m7 ; m6 = word: row 1 - packuswb m4, m6 - vpermq m4, m4, 11011000b - vextracti128 xm6, m4, 1 - movu [r2], xm4 - movu [r2 + r3], xm6 -%else - psubw m4, m7 ; m4 = word: row 0 - psubw m6, m7 ; m6 = word: row 1 - movu [r2], m4 - movu [r2 + r3], m6 -%endif - - movu xm4, [r7 + r1 * 2] - vinserti128 m4, m4, [r7 + r1], 1 - vextracti128 xm1, m4, 1 - vinserti128 m0, m0, xm1, 0 - - punpcklbw m6, m0, m4 - punpckhbw m1, m0, m4 - vperm2i128 m0, m6, m1, 0x20 - vperm2i128 m6, m6, m1, 0x31 - pmaddubsw m1, m0, m9 - paddw m5, m1 - pmaddubsw m0, m8 - pmaddubsw m1, m6, m9 - paddw m2, m1 - pmaddubsw m6, m8 - -%ifidn %1,pp - pmulhrsw m2, m7 ; m2 = word: row 2 - pmulhrsw m5, m7 ; m5 = word: row 3 - packuswb m2, m5 - vpermq m2, m2, 11011000b - vextracti128 xm5, m2, 1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm5 -%else - psubw m2, m7 ; m2 = word: row 2 - psubw m5, m7 ; m5 = word: row 3 - movu [r2 + r3 * 2], m2 - movu [r2 + r6], m5 -%endif - lea r8, [r2 + r3 * 4] - - movu xm1, [r7 + r4] - lea r7, [r7 + r1 * 4] - vinserti128 m1, m1, [r7], 1 - vinserti128 m4, m4, xm1, 1 - - punpcklbw m2, m4, m1 - punpckhbw m5, m4, m1 - vperm2i128 m3, m2, m5, 0x20 - vperm2i128 m2, m2, m5, 0x31 - pmaddubsw m5, m3, m9 - paddw m6, m5 - pmaddubsw m3, m8 - pmaddubsw m5, m2, m9 - paddw m0, m5 - pmaddubsw m2, m8 - -%ifidn %1,pp - pmulhrsw m6, m7 ; m6 = word: row 4 - pmulhrsw m0, m7 ; m0 = word: row 5 - packuswb m6, m0 - vpermq m6, m6, 11011000b - vextracti128 xm0, m6, 1 - movu [r8], xm6 - movu [r8 + r3], xm0 -%else - psubw m6, m7 ; m6 = word: row 4 - psubw m0, m7 ; m0 = word: row 5 - movu [r8], m6 - movu [r8 + r3], m0 -%endif - - movu xm6, [r7 + r1 * 2] - vinserti128 m6, m6, [r7 + r1], 1 - vextracti128 xm0, m6, 1 - vinserti128 m1, m1, xm0, 0 - - punpcklbw m4, m1, m6 - punpckhbw m5, m1, m6 - vperm2i128 m0, m4, m5, 0x20 - vperm2i128 m5, m4, m5, 0x31 - pmaddubsw m4, m0, m9 - paddw m2, m4 - pmaddubsw m0, m8 - pmaddubsw m4, m5, m9 - paddw m3, m4 - pmaddubsw m5, m8 - -%ifidn %1,pp - pmulhrsw m3, m7 ; m3 = word: row 6 - pmulhrsw m2, m7 ; m2 = word: row 7 - packuswb m3, m2 - vpermq m3, m3, 11011000b - vextracti128 xm2, m3, 1 - movu [r8 + r3 * 2], xm3 - movu [r8 + r6], xm2 -%else - psubw m3, m7 ; m3 = word: row 6 - psubw m2, m7 ; m2 = word: row 7 - movu [r8 + r3 * 2], m3 - movu [r8 + r6], m2 -%endif - lea r8, [r8 + r3 * 4] - - movu xm3, [r7 + r4] - lea r7, [r7 + r1 * 4] - vinserti128 m3, m3, [r7], 1 - vinserti128 m6, m6, xm3, 1 - - punpcklbw m2, m6, m3 - punpckhbw m1, m6, m3 - vperm2i128 m4, m2, m1, 0x20 - vperm2i128 m2, m2, m1, 0x31 - pmaddubsw m1, m4, m9 - paddw m5, m1 - pmaddubsw m4, m8 - pmaddubsw m1, m2, m9 - paddw m0, m1 - pmaddubsw m2, m8 - -%ifidn %1,pp - pmulhrsw m5, m7 ; m5 = word: row 8 - pmulhrsw m0, m7 ; m0 = word: row 9 - packuswb m5, m0 - vpermq m5, m5, 11011000b - vextracti128 xm0, m5, 1 - movu [r8], xm5 - movu [r8 + r3], xm0 -%else - psubw m5, m7 ; m5 = word: row 8 - psubw m0, m7 ; m0 = word: row 9 - movu [r8], m5 - movu [r8 + r3], m0 -%endif - - movu xm5, [r7 + r1 * 2] - vinserti128 m5, m5, [r7 + r1], 1 - vextracti128 xm0, m5, 1 - vinserti128 m3, m3, xm0, 0 - - punpcklbw m1, m3, m5 - punpckhbw m0, m3, m5 - vperm2i128 m6, m1, m0, 0x20 - vperm2i128 m0, m1, m0, 0x31 - pmaddubsw m1, m6, m9 - paddw m2, m1 - pmaddubsw m6, m8 - pmaddubsw m1, m0, m9 - paddw m4, m1 - pmaddubsw m0, m8 - -%ifidn %1,pp - pmulhrsw m4, m7 ; m4 = word: row 10 - pmulhrsw m2, m7 ; m2 = word: row 11 - packuswb m4, m2 - vpermq m4, m4, 11011000b - vextracti128 xm2, m4, 1 - movu [r8 + r3 * 2], xm4 - movu [r8 + r6], xm2 -%else - psubw m4, m7 ; m4 = word: row 10 - psubw m2, m7 ; m2 = word: row 11 - movu [r8 + r3 * 2], m4 - movu [r8 + r6], m2 -%endif - lea r8, [r8 + r3 * 4] - - movu xm3, [r7 + r4] - lea r7, [r7 + r1 * 4] - vinserti128 m3, m3, [r7], 1 - vinserti128 m5, m5, xm3, 1 - - punpcklbw m2, m5, m3 - punpckhbw m1, m5, m3 - vperm2i128 m4, m2, m1, 0x20 - vperm2i128 m2, m2, m1, 0x31 - pmaddubsw m1, m4, m9 - paddw m0, m1 - pmaddubsw m4, m8 - pmaddubsw m1, m2, m9 - paddw m6, m1 - pmaddubsw m2, m8 - -%ifidn %1,pp - pmulhrsw m0, m7 ; m0 = word: row 12 - pmulhrsw m6, m7 ; m6 = word: row 13 - packuswb m0, m6 - vpermq m0, m0, 11011000b - vextracti128 xm6, m0, 1 - movu [r8], xm0 - movu [r8 + r3], xm6 -%else - psubw m0, m7 ; m0 = word: row 12 - psubw m6, m7 ; m6 = word: row 13 - movu [r8], m0 - movu [r8 + r3], m6 -%endif - - movu xm5, [r7 + r1 * 2] - vinserti128 m5, m5, [r7 + r1], 1 - vextracti128 xm0, m5, 1 - vinserti128 m3, m3, xm0, 0 - - punpcklbw m1, m3, m5 - punpckhbw m0, m3, m5 - vperm2i128 m6, m1, m0, 0x20 - vperm2i128 m0, m1, m0, 0x31 - pmaddubsw m6, m9 - paddw m2, m6 - pmaddubsw m0, m9 - paddw m4, m0 - -%ifidn %1,pp - pmulhrsw m4, m7 ; m4 = word: row 14 - pmulhrsw m2, m7 ; m2 = word: row 15 - packuswb m4, m2 - vpermq m4, m4, 11011000b - vextracti128 xm2, m4, 1 - movu [r8 + r3 * 2], xm4 - movu [r8 + r6], xm2 - add r2, 16 -%else - psubw m4, m7 ; m4 = word: row 14 - psubw m2, m7 ; m2 = word: row 15 - movu [r8 + r3 * 2], m4 - movu [r8 + r6], m2 - add r2, 32 -%endif - add r0, 16 - movq xm1, [r0] ; m1 = row 0 - movq xm2, [r0 + r1] ; m2 = row 1 - punpcklbw xm1, xm2 - movq xm3, [r0 + r1 * 2] ; m3 = row 2 - punpcklbw xm2, xm3 - vinserti128 m5, m1, xm2, 1 - pmaddubsw m5, m8 - movq xm4, [r0 + r4] ; m4 = row 3 - punpcklbw xm3, xm4 - lea r7, [r0 + r1 * 4] - movq xm1, [r7] ; m1 = row 4 - punpcklbw xm4, xm1 - vinserti128 m2, m3, xm4, 1 - pmaddubsw m0, m2, m9 - paddw m5, m0 - pmaddubsw m2, m8 - movq xm3, [r7 + r1] ; m3 = row 5 - punpcklbw xm1, xm3 - movq xm4, [r7 + r1 * 2] ; m4 = row 6 - punpcklbw xm3, xm4 - vinserti128 m1, m1, xm3, 1 - pmaddubsw m0, m1, m9 - paddw m2, m0 - pmaddubsw m1, m8 - movq xm3, [r7 + r4] ; m3 = row 7 - punpcklbw xm4, xm3 - lea r7, [r7 + r1 * 4] - movq xm0, [r7] ; m0 = row 8 - punpcklbw xm3, xm0 - vinserti128 m4, m4, xm3, 1 - pmaddubsw m3, m4, m9 - paddw m1, m3 - pmaddubsw m4, m8 - movq xm3, [r7 + r1] ; m3 = row 9 - punpcklbw xm0, xm3 - movq xm6, [r7 + r1 * 2] ; m6 = row 10 - punpcklbw xm3, xm6 - vinserti128 m0, m0, xm3, 1 - pmaddubsw m3, m0, m9 - paddw m4, m3 - pmaddubsw m0, m8 - -%ifidn %1,pp - pmulhrsw m5, m7 ; m5 = word: row 0, row 1 - pmulhrsw m2, m7 ; m2 = word: row 2, row 3 - pmulhrsw m1, m7 ; m1 = word: row 4, row 5 - pmulhrsw m4, m7 ; m4 = word: row 6, row 7 - packuswb m5, m2 - packuswb m1, m4 - vextracti128 xm2, m5, 1 - vextracti128 xm4, m1, 1 - movq [r2], xm5 - movq [r2 + r3], xm2 - movhps [r2 + r3 * 2], xm5 - movhps [r2 + r6], xm2 - lea r8, [r2 + r3 * 4] - movq [r8], xm1 - movq [r8 + r3], xm4 - movhps [r8 + r3 * 2], xm1 - movhps [r8 + r6], xm4 -%else - psubw m5, m7 ; m5 = word: row 0, row 1 - psubw m2, m7 ; m2 = word: row 2, row 3 - psubw m1, m7 ; m1 = word: row 4, row 5 - psubw m4, m7 ; m4 = word: row 6, row 7 - vextracti128 xm3, m5, 1 - movu [r2], xm5 - movu [r2 + r3], xm3 - vextracti128 xm3, m2, 1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm3 - vextracti128 xm3, m1, 1 - lea r8, [r2 + r3 * 4] - movu [r8], xm1 - movu [r8 + r3], xm3 - vextracti128 xm3, m4, 1 - movu [r8 + r3 * 2], xm4 - movu [r8 + r6], xm3 -%endif - lea r8, [r8 + r3 * 4] - - movq xm3, [r7 + r4] ; m3 = row 11 - punpcklbw xm6, xm3 - lea r7, [r7 + r1 * 4] - movq xm5, [r7] ; m5 = row 12 - punpcklbw xm3, xm5 - vinserti128 m6, m6, xm3, 1 - pmaddubsw m3, m6, m9 - paddw m0, m3 - pmaddubsw m6, m8 - movq xm3, [r7 + r1] ; m3 = row 13 - punpcklbw xm5, xm3 - movq xm2, [r7 + r1 * 2] ; m2 = row 14 - punpcklbw xm3, xm2 - vinserti128 m5, m5, xm3, 1 - pmaddubsw m3, m5, m9 - paddw m6, m3 - pmaddubsw m5, m8 - movq xm3, [r7 + r4] ; m3 = row 15 - punpcklbw xm2, xm3 - lea r7, [r7 + r1 * 4] - movq xm1, [r7] ; m1 = row 16 - punpcklbw xm3, xm1 - vinserti128 m2, m2, xm3, 1 - pmaddubsw m3, m2, m9 - paddw m5, m3 - pmaddubsw m2, m8 - movq xm3, [r7 + r1] ; m3 = row 17 - punpcklbw xm1, xm3 - movq xm4, [r7 + r1 * 2] ; m4 = row 18 - punpcklbw xm3, xm4 - vinserti128 m1, m1, xm3, 1 - pmaddubsw m3, m1, m9 - paddw m2, m3 -%ifidn %1,pp - pmulhrsw m0, m7 ; m0 = word: row 8, row 9 - pmulhrsw m6, m7 ; m6 = word: row 10, row 11 - pmulhrsw m5, m7 ; m5 = word: row 12, row 13 - pmulhrsw m2, m7 ; m2 = word: row 14, row 15 - packuswb m0, m6 - packuswb m5, m2 - vextracti128 xm6, m0, 1 - vextracti128 xm2, m5, 1 - movq [r8], xm0 - movq [r8 + r3], xm6 - movhps [r8 + r3 * 2], xm0 - movhps [r8 + r6], xm6 - lea r8, [r8 + r3 * 4] - movq [r8], xm5 - movq [r8 + r3], xm2 - movhps [r8 + r3 * 2], xm5 - movhps [r8 + r6], xm2 - lea r2, [r8 + r3 * 4 - 16] -%else - psubw m0, m7 ; m0 = word: row 8, row 9 - psubw m6, m7 ; m6 = word: row 10, row 11 - psubw m5, m7 ; m5 = word: row 12, row 13 - psubw m2, m7 ; m2 = word: row 14, row 15 - vextracti128 xm3, m0, 1 - movu [r8], xm0 - movu [r8 + r3], xm3 - vextracti128 xm3, m6, 1 - movu [r8 + r3 * 2], xm6 - movu [r8 + r6], xm3 - vextracti128 xm3, m5, 1 - lea r8, [r8 + r3 * 4] - movu [r8], xm5 - movu [r8 + r3], xm3 - vextracti128 xm3, m2, 1 - movu [r8 + r3 * 2], xm2 - movu [r8 + r6], xm3 - lea r2, [r8 + r3 * 4 - 32] -%endif - lea r0, [r7 - 16] - dec r5d - jnz .loopH - RET -%endif -%endmacro - - FILTER_VER_CHROMA_AVX2_24x32 pp - FILTER_VER_CHROMA_AVX2_24x32 ps - -%macro FILTER_VER_CHROMA_AVX2_24x64 1 -%if ARCH_X86_64 == 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_24x64, 4, 7, 13 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - mova m10, [r5] - mova m11, [r5 + mmsize] - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - mova m12, [pw_512] -%else - add r3d, r3d - vbroadcasti128 m12, [pw_2000] -%endif - lea r5, [r3 * 3] - mov r6d, 16 -.loopH: - movu m0, [r0] ; m0 = row 0 - movu m1, [r0 + r1] ; m1 = row 1 - punpcklbw m2, m0, m1 - punpckhbw m3, m0, m1 - pmaddubsw m2, m10 - pmaddubsw m3, m10 - movu m0, [r0 + r1 * 2] ; m0 = row 2 - punpcklbw m4, m1, m0 - punpckhbw m5, m1, m0 - pmaddubsw m4, m10 - pmaddubsw m5, m10 - movu m1, [r0 + r4] ; m1 = row 3 - punpcklbw m6, m0, m1 - punpckhbw m7, m0, m1 - pmaddubsw m8, m6, m11 - pmaddubsw m9, m7, m11 - pmaddubsw m6, m10 - pmaddubsw m7, m10 - paddw m2, m8 - paddw m3, m9 -%ifidn %1,pp - pmulhrsw m2, m12 - pmulhrsw m3, m12 - packuswb m2, m3 - movu [r2], xm2 - vextracti128 xm2, m2, 1 - movq [r2 + 16], xm2 -%else - psubw m2, m12 - psubw m3, m12 - vperm2i128 m0, m2, m3, 0x20 - vperm2i128 m2, m2, m3, 0x31 - movu [r2], m0 - movu [r2 + mmsize], xm2 -%endif - lea r0, [r0 + r1 * 4] - movu m0, [r0] ; m0 = row 4 - punpcklbw m2, m1, m0 - punpckhbw m3, m1, m0 - pmaddubsw m8, m2, m11 - pmaddubsw m9, m3, m11 - pmaddubsw m2, m10 - pmaddubsw m3, m10 - paddw m4, m8 - paddw m5, m9 -%ifidn %1,pp - pmulhrsw m4, m12 - pmulhrsw m5, m12 - packuswb m4, m5 - movu [r2 + r3], xm4 - vextracti128 xm4, m4, 1 - movq [r2 + r3 + 16], xm4 -%else - psubw m4, m12 - psubw m5, m12 - vperm2i128 m1, m4, m5, 0x20 - vperm2i128 m4, m4, m5, 0x31 - movu [r2 + r3], m1 - movu [r2 + r3 + mmsize], xm4 -%endif - - movu m1, [r0 + r1] ; m1 = row 5 - punpcklbw m4, m0, m1 - punpckhbw m5, m0, m1 - pmaddubsw m4, m11 - pmaddubsw m5, m11 - paddw m6, m4 - paddw m7, m5 -%ifidn %1,pp - pmulhrsw m6, m12 - pmulhrsw m7, m12 - packuswb m6, m7 - movu [r2 + r3 * 2], xm6 - vextracti128 xm6, m6, 1 - movq [r2 + r3 * 2 + 16], xm6 -%else - psubw m6, m12 - psubw m7, m12 - vperm2i128 m0, m6, m7, 0x20 - vperm2i128 m6, m6, m7, 0x31 - movu [r2 + r3 * 2], m0 - movu [r2 + r3 * 2 + mmsize], xm6 -%endif - - movu m0, [r0 + r1 * 2] ; m0 = row 6 - punpcklbw m6, m1, m0 - punpckhbw m7, m1, m0 - pmaddubsw m6, m11 - pmaddubsw m7, m11 - paddw m2, m6 - paddw m3, m7 -%ifidn %1,pp - pmulhrsw m2, m12 - pmulhrsw m3, m12 - packuswb m2, m3 - movu [r2 + r5], xm2 - vextracti128 xm2, m2, 1 - movq [r2 + r5 + 16], xm2 -%else - psubw m2, m12 - psubw m3, m12 - vperm2i128 m0, m2, m3, 0x20 - vperm2i128 m2, m2, m3, 0x31 - movu [r2 + r5], m0 - movu [r2 + r5 + mmsize], xm2 -%endif - lea r2, [r2 + r3 * 4] - dec r6d - jnz .loopH - RET -%endif -%endmacro - - FILTER_VER_CHROMA_AVX2_24x64 pp - FILTER_VER_CHROMA_AVX2_24x64 ps - -%macro FILTER_VER_CHROMA_AVX2_16x4 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_16x4, 4, 6, 8 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - mova m7, [pw_512] -%else - add r3d, r3d - mova m7, [pw_2000] -%endif - - movu xm0, [r0] - vinserti128 m0, m0, [r0 + r1 * 2], 1 - movu xm1, [r0 + r1] - vinserti128 m1, m1, [r0 + r4], 1 - - punpcklbw m2, m0, m1 - punpckhbw m3, m0, m1 - vperm2i128 m4, m2, m3, 0x20 - vperm2i128 m2, m2, m3, 0x31 - pmaddubsw m4, [r5] - pmaddubsw m3, m2, [r5 + mmsize] - paddw m4, m3 - pmaddubsw m2, [r5] - - vextracti128 xm0, m0, 1 - lea r0, [r0 + r1 * 4] - vinserti128 m0, m0, [r0], 1 - - punpcklbw m5, m1, m0 - punpckhbw m3, m1, m0 - vperm2i128 m6, m5, m3, 0x20 - vperm2i128 m5, m5, m3, 0x31 - pmaddubsw m6, [r5] - pmaddubsw m3, m5, [r5 + mmsize] - paddw m6, m3 - pmaddubsw m5, [r5] -%ifidn %1,pp - pmulhrsw m4, m7 ; m4 = word: row 0 - pmulhrsw m6, m7 ; m6 = word: row 1 - packuswb m4, m6 - vpermq m4, m4, 11011000b - vextracti128 xm6, m4, 1 - movu [r2], xm4 - movu [r2 + r3], xm6 -%else - psubw m4, m7 ; m4 = word: row 0 - psubw m6, m7 ; m6 = word: row 1 - movu [r2], m4 - movu [r2 + r3], m6 -%endif - lea r2, [r2 + r3 * 2] - - movu xm4, [r0 + r1 * 2] - vinserti128 m4, m4, [r0 + r1], 1 - vextracti128 xm1, m4, 1 - vinserti128 m0, m0, xm1, 0 - - punpcklbw m6, m0, m4 - punpckhbw m1, m0, m4 - vperm2i128 m0, m6, m1, 0x20 - vperm2i128 m6, m6, m1, 0x31 - pmaddubsw m0, [r5 + mmsize] - paddw m5, m0 - pmaddubsw m6, [r5 + mmsize] - paddw m2, m6 - -%ifidn %1,pp - pmulhrsw m2, m7 ; m2 = word: row 2 - pmulhrsw m5, m7 ; m5 = word: row 3 - packuswb m2, m5 - vpermq m2, m2, 11011000b - vextracti128 xm5, m2, 1 - movu [r2], xm2 - movu [r2 + r3], xm5 -%else - psubw m2, m7 ; m2 = word: row 2 - psubw m5, m7 ; m5 = word: row 3 - movu [r2], m2 - movu [r2 + r3], m5 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_16x4 pp - FILTER_VER_CHROMA_AVX2_16x4 ps - -%macro FILTER_VER_CHROMA_AVX2_12xN 2 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_12x%2, 4, 7, 8 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - mova m7, [pw_512] -%else - add r3d, r3d - vbroadcasti128 m7, [pw_2000] -%endif - lea r6, [r3 * 3] -%rep %2 / 16 - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhbw xm2, xm0, xm1 - punpcklbw xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddubsw m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhbw xm3, xm1, xm2 - punpcklbw xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddubsw m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhbw xm4, xm2, xm3 - punpcklbw xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddubsw m4, m2, [r5 + 1 * mmsize] - paddw m0, m4 - pmaddubsw m2, [r5] - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhbw xm5, xm3, xm4 - punpcklbw xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddubsw m5, m3, [r5 + 1 * mmsize] - paddw m1, m5 - pmaddubsw m3, [r5] -%ifidn %1,pp - pmulhrsw m0, m7 ; m0 = word: row 0 - pmulhrsw m1, m7 ; m1 = word: row 1 - packuswb m0, m1 - vextracti128 xm1, m0, 1 - movq [r2], xm0 - movd [r2 + 8], xm1 - movhps [r2 + r3], xm0 - pextrd [r2 + r3 + 8], xm1, 2 -%else - psubw m0, m7 ; m0 = word: row 0 - psubw m1, m7 ; m1 = word: row 1 - movu [r2], xm0 - vextracti128 xm0, m0, 1 - movq [r2 + 16], xm0 - movu [r2 + r3], xm1 - vextracti128 xm1, m1, 1 - movq [r2 + r3 + 16], xm1 -%endif - - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhbw xm6, xm4, xm5 - punpcklbw xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddubsw m6, m4, [r5 + 1 * mmsize] - paddw m2, m6 - pmaddubsw m4, [r5] - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhbw xm0, xm5, xm6 - punpcklbw xm5, xm6 - vinserti128 m5, m5, xm0, 1 - pmaddubsw m0, m5, [r5 + 1 * mmsize] - paddw m3, m0 - pmaddubsw m5, [r5] -%ifidn %1,pp - pmulhrsw m2, m7 ; m2 = word: row 2 - pmulhrsw m3, m7 ; m3 = word: row 3 - packuswb m2, m3 - vextracti128 xm3, m2, 1 - movq [r2 + r3 * 2], xm2 - movd [r2 + r3 * 2 + 8], xm3 - movhps [r2 + r6], xm2 - pextrd [r2 + r6 + 8], xm3, 2 -%else - psubw m2, m7 ; m2 = word: row 2 - psubw m3, m7 ; m3 = word: row 3 - movu [r2 + r3 * 2], xm2 - vextracti128 xm2, m2, 1 - movq [r2 + r3 * 2 + 16], xm2 - movu [r2 + r6], xm3 - vextracti128 xm3, m3, 1 - movq [r2 + r6 + 16], xm3 -%endif - lea r2, [r2 + r3 * 4] - - movu xm0, [r0 + r4] ; m0 = row 7 - punpckhbw xm3, xm6, xm0 - punpcklbw xm6, xm0 - vinserti128 m6, m6, xm3, 1 - pmaddubsw m3, m6, [r5 + 1 * mmsize] - paddw m4, m3 - pmaddubsw m6, [r5] - lea r0, [r0 + r1 * 4] - movu xm3, [r0] ; m3 = row 8 - punpckhbw xm1, xm0, xm3 - punpcklbw xm0, xm3 - vinserti128 m0, m0, xm1, 1 - pmaddubsw m1, m0, [r5 + 1 * mmsize] - paddw m5, m1 - pmaddubsw m0, [r5] -%ifidn %1,pp - pmulhrsw m4, m7 ; m4 = word: row 4 - pmulhrsw m5, m7 ; m5 = word: row 5 - packuswb m4, m5 - vextracti128 xm5, m4, 1 - movq [r2], xm4 - movd [r2 + 8], xm5 - movhps [r2 + r3], xm4 - pextrd [r2 + r3 + 8], xm5, 2 -%else - psubw m4, m7 ; m4 = word: row 4 - psubw m5, m7 ; m5 = word: row 5 - movu [r2], xm4 - vextracti128 xm4, m4, 1 - movq [r2 + 16], xm4 - movu [r2 + r3], xm5 - vextracti128 xm5, m5, 1 - movq [r2 + r3 + 16], xm5 -%endif - - movu xm1, [r0 + r1] ; m1 = row 9 - punpckhbw xm2, xm3, xm1 - punpcklbw xm3, xm1 - vinserti128 m3, m3, xm2, 1 - pmaddubsw m2, m3, [r5 + 1 * mmsize] - paddw m6, m2 - pmaddubsw m3, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 10 - punpckhbw xm4, xm1, xm2 - punpcklbw xm1, xm2 - vinserti128 m1, m1, xm4, 1 - pmaddubsw m4, m1, [r5 + 1 * mmsize] - paddw m0, m4 - pmaddubsw m1, [r5] - -%ifidn %1,pp - pmulhrsw m6, m7 ; m6 = word: row 6 - pmulhrsw m0, m7 ; m0 = word: row 7 - packuswb m6, m0 - vextracti128 xm0, m6, 1 - movq [r2 + r3 * 2], xm6 - movd [r2 + r3 * 2 + 8], xm0 - movhps [r2 + r6], xm6 - pextrd [r2 + r6 + 8], xm0, 2 -%else - psubw m6, m7 ; m6 = word: row 6 - psubw m0, m7 ; m0 = word: row 7 - movu [r2 + r3 * 2], xm6 - vextracti128 xm6, m6, 1 - movq [r2 + r3 * 2 + 16], xm6 - movu [r2 + r6], xm0 - vextracti128 xm0, m0, 1 - movq [r2 + r6 + 16], xm0 -%endif - lea r2, [r2 + r3 * 4] - - movu xm4, [r0 + r4] ; m4 = row 11 - punpckhbw xm6, xm2, xm4 - punpcklbw xm2, xm4 - vinserti128 m2, m2, xm6, 1 - pmaddubsw m6, m2, [r5 + 1 * mmsize] - paddw m3, m6 - pmaddubsw m2, [r5] - lea r0, [r0 + r1 * 4] - movu xm6, [r0] ; m6 = row 12 - punpckhbw xm0, xm4, xm6 - punpcklbw xm4, xm6 - vinserti128 m4, m4, xm0, 1 - pmaddubsw m0, m4, [r5 + 1 * mmsize] - paddw m1, m0 - pmaddubsw m4, [r5] -%ifidn %1,pp - pmulhrsw m3, m7 ; m3 = word: row 8 - pmulhrsw m1, m7 ; m1 = word: row 9 - packuswb m3, m1 - vextracti128 xm1, m3, 1 - movq [r2], xm3 - movd [r2 + 8], xm1 - movhps [r2 + r3], xm3 - pextrd [r2 + r3 + 8], xm1, 2 -%else - psubw m3, m7 ; m3 = word: row 8 - psubw m1, m7 ; m1 = word: row 9 - movu [r2], xm3 - vextracti128 xm3, m3, 1 - movq [r2 + 16], xm3 - movu [r2 + r3], xm1 - vextracti128 xm1, m1, 1 - movq [r2 + r3 + 16], xm1 -%endif - - movu xm0, [r0 + r1] ; m0 = row 13 - punpckhbw xm1, xm6, xm0 - punpcklbw xm6, xm0 - vinserti128 m6, m6, xm1, 1 - pmaddubsw m1, m6, [r5 + 1 * mmsize] - paddw m2, m1 - pmaddubsw m6, [r5] - movu xm1, [r0 + r1 * 2] ; m1 = row 14 - punpckhbw xm5, xm0, xm1 - punpcklbw xm0, xm1 - vinserti128 m0, m0, xm5, 1 - pmaddubsw m5, m0, [r5 + 1 * mmsize] - paddw m4, m5 - pmaddubsw m0, [r5] -%ifidn %1,pp - pmulhrsw m2, m7 ; m2 = word: row 10 - pmulhrsw m4, m7 ; m4 = word: row 11 - packuswb m2, m4 - vextracti128 xm4, m2, 1 - movq [r2 + r3 * 2], xm2 - movd [r2 + r3 * 2 + 8], xm4 - movhps [r2 + r6], xm2 - pextrd [r2 + r6 + 8], xm4, 2 -%else - psubw m2, m7 ; m2 = word: row 10 - psubw m4, m7 ; m4 = word: row 11 - movu [r2 + r3 * 2], xm2 - vextracti128 xm2, m2, 1 - movq [r2 + r3 * 2 + 16], xm2 - movu [r2 + r6], xm4 - vextracti128 xm4, m4, 1 - movq [r2 + r6 + 16], xm4 -%endif - lea r2, [r2 + r3 * 4] - - movu xm5, [r0 + r4] ; m5 = row 15 - punpckhbw xm2, xm1, xm5 - punpcklbw xm1, xm5 - vinserti128 m1, m1, xm2, 1 - pmaddubsw m2, m1, [r5 + 1 * mmsize] - paddw m6, m2 - pmaddubsw m1, [r5] - lea r0, [r0 + r1 * 4] - movu xm2, [r0] ; m2 = row 16 - punpckhbw xm3, xm5, xm2 - punpcklbw xm5, xm2 - vinserti128 m5, m5, xm3, 1 - pmaddubsw m3, m5, [r5 + 1 * mmsize] - paddw m0, m3 - pmaddubsw m5, [r5] - movu xm3, [r0 + r1] ; m3 = row 17 - punpckhbw xm4, xm2, xm3 - punpcklbw xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddubsw m2, [r5 + 1 * mmsize] - paddw m1, m2 - movu xm4, [r0 + r1 * 2] ; m4 = row 18 - punpckhbw xm2, xm3, xm4 - punpcklbw xm3, xm4 - vinserti128 m3, m3, xm2, 1 - pmaddubsw m3, [r5 + 1 * mmsize] - paddw m5, m3 - -%ifidn %1,pp - pmulhrsw m6, m7 ; m6 = word: row 12 - pmulhrsw m0, m7 ; m0 = word: row 13 - pmulhrsw m1, m7 ; m1 = word: row 14 - pmulhrsw m5, m7 ; m5 = word: row 15 - packuswb m6, m0 - packuswb m1, m5 - vextracti128 xm0, m6, 1 - vextracti128 xm5, m1, 1 - movq [r2], xm6 - movd [r2 + 8], xm0 - movhps [r2 + r3], xm6 - pextrd [r2 + r3 + 8], xm0, 2 - movq [r2 + r3 * 2], xm1 - movd [r2 + r3 * 2 + 8], xm5 - movhps [r2 + r6], xm1 - pextrd [r2 + r6 + 8], xm5, 2 -%else - psubw m6, m7 ; m6 = word: row 12 - psubw m0, m7 ; m0 = word: row 13 - psubw m1, m7 ; m1 = word: row 14 - psubw m5, m7 ; m5 = word: row 15 - movu [r2], xm6 - vextracti128 xm6, m6, 1 - movq [r2 + 16], xm6 - movu [r2 + r3], xm0 - vextracti128 xm0, m0, 1 - movq [r2 + r3 + 16], xm0 - movu [r2 + r3 * 2], xm1 - vextracti128 xm1, m1, 1 - movq [r2 + r3 * 2 + 16], xm1 - movu [r2 + r6], xm5 - vextracti128 xm5, m5, 1 - movq [r2 + r6 + 16], xm5 -%endif - lea r2, [r2 + r3 * 4] -%endrep - RET -%endmacro - - FILTER_VER_CHROMA_AVX2_12xN pp, 16 - FILTER_VER_CHROMA_AVX2_12xN ps, 16 - FILTER_VER_CHROMA_AVX2_12xN pp, 32 - FILTER_VER_CHROMA_AVX2_12xN ps, 32 - -;----------------------------------------------------------------------------- -;void interp_4tap_vert_pp_24x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W24 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_pp_24x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m1, m0, [tab_Vm] - pshufb m0, [tab_Vm + 16] - - mov r4d, %2 - -.loop: - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - pmaddubsw m4, m1 - pmaddubsw m2, m1 - - lea r5, [r0 + 2 * r1] - movu m5, [r5] - movu m7, [r5 + r1] - - punpcklbw m6, m5, m7 - pmaddubsw m6, m0 - paddw m4, m6 - - punpckhbw m6, m5, m7 - pmaddubsw m6, m0 - paddw m2, m6 - - mova m6, [pw_512] - - pmulhrsw m4, m6 - pmulhrsw m2, m6 - - packuswb m4, m2 - - movu [r2], m4 - - punpcklbw m4, m3, m5 - punpckhbw m3, m5 - - pmaddubsw m4, m1 - pmaddubsw m3, m1 - - movu m2, [r5 + 2 * r1] - - punpcklbw m5, m7, m2 - punpckhbw m7, m2 - - pmaddubsw m5, m0 - pmaddubsw m7, m0 - - paddw m4, m5 - paddw m3, m7 - - pmulhrsw m4, m6 - pmulhrsw m3, m6 - - packuswb m4, m3 - - movu [r2 + r3], m4 - - movq m2, [r0 + 16] - movq m3, [r0 + r1 + 16] - movq m4, [r5 + 16] - movq m5, [r5 + r1 + 16] - - punpcklbw m2, m3 - punpcklbw m4, m5 - - pmaddubsw m2, m1 - pmaddubsw m4, m0 - - paddw m2, m4 - - pmulhrsw m2, m6 - - movq m3, [r0 + r1 + 16] - movq m4, [r5 + 16] - movq m5, [r5 + r1 + 16] - movq m7, [r5 + 2 * r1 + 16] - - punpcklbw m3, m4 - punpcklbw m5, m7 - - pmaddubsw m3, m1 - pmaddubsw m5, m0 - - paddw m3, m5 - - pmulhrsw m3, m6 - packuswb m2, m3 - - movh [r2 + 16], m2 - movhps [r2 + r3 + 16], m2 - - mov r0, r5 - lea r2, [r2 + 2 * r3] - - sub r4, 2 - jnz .loop - RET -%endmacro - - FILTER_V4_W24 24, 32 - - FILTER_V4_W24 24, 64 - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_32x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W32 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m1, m0, [tab_Vm] - pshufb m0, [tab_Vm + 16] - - mova m7, [pw_512] - - mov r4d, %2 - -.loop: - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - pmaddubsw m4, m1 - pmaddubsw m2, m1 - - lea r5, [r0 + 2 * r1] - movu m3, [r5] - movu m5, [r5 + r1] - - punpcklbw m6, m3, m5 - punpckhbw m3, m5 - - pmaddubsw m6, m0 - pmaddubsw m3, m0 - - paddw m4, m6 - paddw m2, m3 - - pmulhrsw m4, m7 - pmulhrsw m2, m7 - - packuswb m4, m2 - - movu [r2], m4 - - movu m2, [r0 + 16] - movu m3, [r0 + r1 + 16] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - pmaddubsw m4, m1 - pmaddubsw m2, m1 - - movu m3, [r5 + 16] - movu m5, [r5 + r1 + 16] - - punpcklbw m6, m3, m5 - punpckhbw m3, m5 - - pmaddubsw m6, m0 - pmaddubsw m3, m0 - - paddw m4, m6 - paddw m2, m3 - - pmulhrsw m4, m7 - pmulhrsw m2, m7 - - packuswb m4, m2 - - movu [r2 + 16], m4 - - lea r0, [r0 + r1] - lea r2, [r2 + r3] - - dec r4 - jnz .loop - RET -%endmacro - - FILTER_V4_W32 32, 8 - FILTER_V4_W32 32, 16 - FILTER_V4_W32 32, 24 - FILTER_V4_W32 32, 32 - - FILTER_V4_W32 32, 48 - FILTER_V4_W32 32, 64 - -%macro FILTER_VER_CHROMA_AVX2_32xN 2 -%if ARCH_X86_64 == 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_32x%2, 4, 7, 13 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - mova m10, [r5] - mova m11, [r5 + mmsize] - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - mova m12, [pw_512] -%else - add r3d, r3d - vbroadcasti128 m12, [pw_2000] -%endif - lea r5, [r3 * 3] - mov r6d, %2 / 4 -.loopW: - movu m0, [r0] ; m0 = row 0 - movu m1, [r0 + r1] ; m1 = row 1 - punpcklbw m2, m0, m1 - punpckhbw m3, m0, m1 - pmaddubsw m2, m10 - pmaddubsw m3, m10 - movu m0, [r0 + r1 * 2] ; m0 = row 2 - punpcklbw m4, m1, m0 - punpckhbw m5, m1, m0 - pmaddubsw m4, m10 - pmaddubsw m5, m10 - movu m1, [r0 + r4] ; m1 = row 3 - punpcklbw m6, m0, m1 - punpckhbw m7, m0, m1 - pmaddubsw m8, m6, m11 - pmaddubsw m9, m7, m11 - pmaddubsw m6, m10 - pmaddubsw m7, m10 - paddw m2, m8 - paddw m3, m9 -%ifidn %1,pp - pmulhrsw m2, m12 - pmulhrsw m3, m12 - packuswb m2, m3 - movu [r2], m2 -%else - psubw m2, m12 - psubw m3, m12 - vperm2i128 m0, m2, m3, 0x20 - vperm2i128 m2, m2, m3, 0x31 - movu [r2], m0 - movu [r2 + mmsize], m2 -%endif - lea r0, [r0 + r1 * 4] - movu m0, [r0] ; m0 = row 4 - punpcklbw m2, m1, m0 - punpckhbw m3, m1, m0 - pmaddubsw m8, m2, m11 - pmaddubsw m9, m3, m11 - pmaddubsw m2, m10 - pmaddubsw m3, m10 - paddw m4, m8 - paddw m5, m9 -%ifidn %1,pp - pmulhrsw m4, m12 - pmulhrsw m5, m12 - packuswb m4, m5 - movu [r2 + r3], m4 -%else - psubw m4, m12 - psubw m5, m12 - vperm2i128 m1, m4, m5, 0x20 - vperm2i128 m4, m4, m5, 0x31 - movu [r2 + r3], m1 - movu [r2 + r3 + mmsize], m4 -%endif - - movu m1, [r0 + r1] ; m1 = row 5 - punpcklbw m4, m0, m1 - punpckhbw m5, m0, m1 - pmaddubsw m4, m11 - pmaddubsw m5, m11 - paddw m6, m4 - paddw m7, m5 -%ifidn %1,pp - pmulhrsw m6, m12 - pmulhrsw m7, m12 - packuswb m6, m7 - movu [r2 + r3 * 2], m6 -%else - psubw m6, m12 - psubw m7, m12 - vperm2i128 m0, m6, m7, 0x20 - vperm2i128 m6, m6, m7, 0x31 - movu [r2 + r3 * 2], m0 - movu [r2 + r3 * 2 + mmsize], m6 -%endif - - movu m0, [r0 + r1 * 2] ; m0 = row 6 - punpcklbw m6, m1, m0 - punpckhbw m7, m1, m0 - pmaddubsw m6, m11 - pmaddubsw m7, m11 - paddw m2, m6 - paddw m3, m7 -%ifidn %1,pp - pmulhrsw m2, m12 - pmulhrsw m3, m12 - packuswb m2, m3 - movu [r2 + r5], m2 -%else - psubw m2, m12 - psubw m3, m12 - vperm2i128 m0, m2, m3, 0x20 - vperm2i128 m2, m2, m3, 0x31 - movu [r2 + r5], m0 - movu [r2 + r5 + mmsize], m2 -%endif - lea r2, [r2 + r3 * 4] - dec r6d - jnz .loopW - RET -%endif -%endmacro - - FILTER_VER_CHROMA_AVX2_32xN pp, 64 - FILTER_VER_CHROMA_AVX2_32xN pp, 48 - FILTER_VER_CHROMA_AVX2_32xN pp, 32 - FILTER_VER_CHROMA_AVX2_32xN pp, 24 - FILTER_VER_CHROMA_AVX2_32xN pp, 16 - FILTER_VER_CHROMA_AVX2_32xN pp, 8 - FILTER_VER_CHROMA_AVX2_32xN ps, 64 - FILTER_VER_CHROMA_AVX2_32xN ps, 48 - FILTER_VER_CHROMA_AVX2_32xN ps, 32 - FILTER_VER_CHROMA_AVX2_32xN ps, 24 - FILTER_VER_CHROMA_AVX2_32xN ps, 16 - FILTER_VER_CHROMA_AVX2_32xN ps, 8 - -%macro FILTER_VER_CHROMA_AVX2_48x64 1 -%if ARCH_X86_64 == 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_48x64, 4, 8, 13 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - mova m10, [r5] - mova m11, [r5 + mmsize] - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - mova m12, [pw_512] -%else - add r3d, r3d - vbroadcasti128 m12, [pw_2000] -%endif - lea r5, [r3 * 3] - lea r7, [r1 * 4] - mov r6d, 16 -.loopH: - movu m0, [r0] ; m0 = row 0 - movu m1, [r0 + r1] ; m1 = row 1 - punpcklbw m2, m0, m1 - punpckhbw m3, m0, m1 - pmaddubsw m2, m10 - pmaddubsw m3, m10 - movu m0, [r0 + r1 * 2] ; m0 = row 2 - punpcklbw m4, m1, m0 - punpckhbw m5, m1, m0 - pmaddubsw m4, m10 - pmaddubsw m5, m10 - movu m1, [r0 + r4] ; m1 = row 3 - punpcklbw m6, m0, m1 - punpckhbw m7, m0, m1 - pmaddubsw m8, m6, m11 - pmaddubsw m9, m7, m11 - pmaddubsw m6, m10 - pmaddubsw m7, m10 - paddw m2, m8 - paddw m3, m9 -%ifidn %1,pp - pmulhrsw m2, m12 - pmulhrsw m3, m12 - packuswb m2, m3 - movu [r2], m2 -%else - psubw m2, m12 - psubw m3, m12 - vperm2i128 m0, m2, m3, 0x20 - vperm2i128 m2, m2, m3, 0x31 - movu [r2], m0 - movu [r2 + mmsize], m2 -%endif - lea r0, [r0 + r1 * 4] - movu m0, [r0] ; m0 = row 4 - punpcklbw m2, m1, m0 - punpckhbw m3, m1, m0 - pmaddubsw m8, m2, m11 - pmaddubsw m9, m3, m11 - pmaddubsw m2, m10 - pmaddubsw m3, m10 - paddw m4, m8 - paddw m5, m9 -%ifidn %1,pp - pmulhrsw m4, m12 - pmulhrsw m5, m12 - packuswb m4, m5 - movu [r2 + r3], m4 -%else - psubw m4, m12 - psubw m5, m12 - vperm2i128 m1, m4, m5, 0x20 - vperm2i128 m4, m4, m5, 0x31 - movu [r2 + r3], m1 - movu [r2 + r3 + mmsize], m4 -%endif - - movu m1, [r0 + r1] ; m1 = row 5 - punpcklbw m4, m0, m1 - punpckhbw m5, m0, m1 - pmaddubsw m4, m11 - pmaddubsw m5, m11 - paddw m6, m4 - paddw m7, m5 -%ifidn %1,pp - pmulhrsw m6, m12 - pmulhrsw m7, m12 - packuswb m6, m7 - movu [r2 + r3 * 2], m6 -%else - psubw m6, m12 - psubw m7, m12 - vperm2i128 m0, m6, m7, 0x20 - vperm2i128 m6, m6, m7, 0x31 - movu [r2 + r3 * 2], m0 - movu [r2 + r3 * 2 + mmsize], m6 -%endif - - movu m0, [r0 + r1 * 2] ; m0 = row 6 - punpcklbw m6, m1, m0 - punpckhbw m7, m1, m0 - pmaddubsw m6, m11 - pmaddubsw m7, m11 - paddw m2, m6 - paddw m3, m7 -%ifidn %1,pp - pmulhrsw m2, m12 - pmulhrsw m3, m12 - packuswb m2, m3 - movu [r2 + r5], m2 - add r2, 32 -%else - psubw m2, m12 - psubw m3, m12 - vperm2i128 m0, m2, m3, 0x20 - vperm2i128 m2, m2, m3, 0x31 - movu [r2 + r5], m0 - movu [r2 + r5 + mmsize], m2 - add r2, 64 -%endif - sub r0, r7 - - movu xm0, [r0 + 32] ; m0 = row 0 - movu xm1, [r0 + r1 + 32] ; m1 = row 1 - punpckhbw xm2, xm0, xm1 - punpcklbw xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddubsw m0, m10 - movu xm2, [r0 + r1 * 2 + 32] ; m2 = row 2 - punpckhbw xm3, xm1, xm2 - punpcklbw xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddubsw m1, m10 - movu xm3, [r0 + r4 + 32] ; m3 = row 3 - punpckhbw xm4, xm2, xm3 - punpcklbw xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddubsw m4, m2, m11 - paddw m0, m4 - pmaddubsw m2, m10 - lea r0, [r0 + r1 * 4] - movu xm4, [r0 + 32] ; m4 = row 4 - punpckhbw xm5, xm3, xm4 - punpcklbw xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddubsw m5, m3, m11 - paddw m1, m5 - pmaddubsw m3, m10 - movu xm5, [r0 + r1 + 32] ; m5 = row 5 - punpckhbw xm6, xm4, xm5 - punpcklbw xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddubsw m4, m11 - paddw m2, m4 - movu xm6, [r0 + r1 * 2 + 32] ; m6 = row 6 - punpckhbw xm7, xm5, xm6 - punpcklbw xm5, xm6 - vinserti128 m5, m5, xm7, 1 - pmaddubsw m5, m11 - paddw m3, m5 -%ifidn %1,pp - pmulhrsw m0, m12 ; m0 = word: row 0 - pmulhrsw m1, m12 ; m1 = word: row 1 - pmulhrsw m2, m12 ; m2 = word: row 2 - pmulhrsw m3, m12 ; m3 = word: row 3 - packuswb m0, m1 - packuswb m2, m3 - vpermq m0, m0, 11011000b - vpermq m2, m2, 11011000b - vextracti128 xm1, m0, 1 - vextracti128 xm3, m2, 1 - movu [r2], xm0 - movu [r2 + r3], xm1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r5], xm3 - lea r2, [r2 + r3 * 4 - 32] -%else - psubw m0, m12 ; m0 = word: row 0 - psubw m1, m12 ; m1 = word: row 1 - psubw m2, m12 ; m2 = word: row 2 - psubw m3, m12 ; m3 = word: row 3 - movu [r2], m0 - movu [r2 + r3], m1 - movu [r2 + r3 * 2], m2 - movu [r2 + r5], m3 - lea r2, [r2 + r3 * 4 - 64] -%endif - dec r6d - jnz .loopH - RET -%endif -%endmacro - - FILTER_VER_CHROMA_AVX2_48x64 pp - FILTER_VER_CHROMA_AVX2_48x64 ps - -%macro FILTER_VER_CHROMA_AVX2_64xN 2 -%if ARCH_X86_64 == 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_64x%2, 4, 8, 13 - mov r4d, r4m - shl r4d, 6 - -%ifdef PIC - lea r5, [tab_ChromaCoeffVer_32] - add r5, r4 -%else - lea r5, [tab_ChromaCoeffVer_32 + r4] -%endif - - mova m10, [r5] - mova m11, [r5 + mmsize] - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,pp - mova m12, [pw_512] -%else - add r3d, r3d - vbroadcasti128 m12, [pw_2000] -%endif - lea r5, [r3 * 3] - lea r7, [r1 * 4] - mov r6d, %2 / 4 -.loopH: -%assign x 0 -%rep 2 - movu m0, [r0 + x] ; m0 = row 0 - movu m1, [r0 + r1 + x] ; m1 = row 1 - punpcklbw m2, m0, m1 - punpckhbw m3, m0, m1 - pmaddubsw m2, m10 - pmaddubsw m3, m10 - movu m0, [r0 + r1 * 2 + x] ; m0 = row 2 - punpcklbw m4, m1, m0 - punpckhbw m5, m1, m0 - pmaddubsw m4, m10 - pmaddubsw m5, m10 - movu m1, [r0 + r4 + x] ; m1 = row 3 - punpcklbw m6, m0, m1 - punpckhbw m7, m0, m1 - pmaddubsw m8, m6, m11 - pmaddubsw m9, m7, m11 - pmaddubsw m6, m10 - pmaddubsw m7, m10 - paddw m2, m8 - paddw m3, m9 -%ifidn %1,pp - pmulhrsw m2, m12 - pmulhrsw m3, m12 - packuswb m2, m3 - movu [r2], m2 -%else - psubw m2, m12 - psubw m3, m12 - vperm2i128 m0, m2, m3, 0x20 - vperm2i128 m2, m2, m3, 0x31 - movu [r2], m0 - movu [r2 + mmsize], m2 -%endif - lea r0, [r0 + r1 * 4] - movu m0, [r0 + x] ; m0 = row 4 - punpcklbw m2, m1, m0 - punpckhbw m3, m1, m0 - pmaddubsw m8, m2, m11 - pmaddubsw m9, m3, m11 - pmaddubsw m2, m10 - pmaddubsw m3, m10 - paddw m4, m8 - paddw m5, m9 -%ifidn %1,pp - pmulhrsw m4, m12 - pmulhrsw m5, m12 - packuswb m4, m5 - movu [r2 + r3], m4 -%else - psubw m4, m12 - psubw m5, m12 - vperm2i128 m1, m4, m5, 0x20 - vperm2i128 m4, m4, m5, 0x31 - movu [r2 + r3], m1 - movu [r2 + r3 + mmsize], m4 -%endif - - movu m1, [r0 + r1 + x] ; m1 = row 5 - punpcklbw m4, m0, m1 - punpckhbw m5, m0, m1 - pmaddubsw m4, m11 - pmaddubsw m5, m11 - paddw m6, m4 - paddw m7, m5 -%ifidn %1,pp - pmulhrsw m6, m12 - pmulhrsw m7, m12 - packuswb m6, m7 - movu [r2 + r3 * 2], m6 -%else - psubw m6, m12 - psubw m7, m12 - vperm2i128 m0, m6, m7, 0x20 - vperm2i128 m6, m6, m7, 0x31 - movu [r2 + r3 * 2], m0 - movu [r2 + r3 * 2 + mmsize], m6 -%endif - - movu m0, [r0 + r1 * 2 + x] ; m0 = row 6 - punpcklbw m6, m1, m0 - punpckhbw m7, m1, m0 - pmaddubsw m6, m11 - pmaddubsw m7, m11 - paddw m2, m6 - paddw m3, m7 -%ifidn %1,pp - pmulhrsw m2, m12 - pmulhrsw m3, m12 - packuswb m2, m3 - movu [r2 + r5], m2 - add r2, 32 -%else - psubw m2, m12 - psubw m3, m12 - vperm2i128 m0, m2, m3, 0x20 - vperm2i128 m2, m2, m3, 0x31 - movu [r2 + r5], m0 - movu [r2 + r5 + mmsize], m2 - add r2, 64 -%endif - sub r0, r7 -%assign x x+32 -%endrep -%ifidn %1,pp - lea r2, [r2 + r3 * 4 - 64] -%else - lea r2, [r2 + r3 * 4 - 128] -%endif - add r0, r7 - dec r6d - jnz .loopH - RET -%endif -%endmacro - - FILTER_VER_CHROMA_AVX2_64xN pp, 64 - FILTER_VER_CHROMA_AVX2_64xN pp, 48 - FILTER_VER_CHROMA_AVX2_64xN pp, 32 - FILTER_VER_CHROMA_AVX2_64xN pp, 16 - FILTER_VER_CHROMA_AVX2_64xN ps, 64 - FILTER_VER_CHROMA_AVX2_64xN ps, 48 - FILTER_VER_CHROMA_AVX2_64xN ps, 32 - FILTER_VER_CHROMA_AVX2_64xN ps, 16 - -;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro FILTER_V4_W16n_H2 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_pp_%1x%2, 4, 7, 8 - - mov r4d, r4m - sub r0, r1 - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m1, m0, [tab_Vm] - pshufb m0, [tab_Vm + 16] - - mov r4d, %2/2 - -.loop: - - mov r6d, %1/16 - -.loopW: - - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - pmaddubsw m4, m1 - pmaddubsw m2, m1 - - lea r5, [r0 + 2 * r1] - movu m5, [r5] - movu m6, [r5 + r1] - - punpckhbw m7, m5, m6 - pmaddubsw m7, m0 - paddw m2, m7 - - punpcklbw m7, m5, m6 - pmaddubsw m7, m0 - paddw m4, m7 - - mova m7, [pw_512] - - pmulhrsw m4, m7 - pmulhrsw m2, m7 - - packuswb m4, m2 - - movu [r2], m4 - - punpcklbw m4, m3, m5 - punpckhbw m3, m5 - - pmaddubsw m4, m1 - pmaddubsw m3, m1 - - movu m5, [r5 + 2 * r1] - - punpcklbw m2, m6, m5 - punpckhbw m6, m5 - - pmaddubsw m2, m0 - pmaddubsw m6, m0 - - paddw m4, m2 - paddw m3, m6 - - pmulhrsw m4, m7 - pmulhrsw m3, m7 - - packuswb m4, m3 - - movu [r2 + r3], m4 - - add r0, 16 - add r2, 16 - dec r6d - jnz .loopW - - lea r0, [r0 + r1 * 2 - %1] - lea r2, [r2 + r3 * 2 - %1] - - dec r4d - jnz .loop - RET -%endmacro - - FILTER_V4_W16n_H2 64, 64 - FILTER_V4_W16n_H2 64, 32 - FILTER_V4_W16n_H2 64, 48 - FILTER_V4_W16n_H2 48, 64 - FILTER_V4_W16n_H2 64, 16 - -%macro PROCESS_CHROMA_SP_W4_4R 0 - movq m0, [r0] - movq m1, [r0 + r1] - punpcklwd m0, m1 ;m0=[0 1] - pmaddwd m0, [r6 + 0 *16] ;m0=[0+1] Row1 - - lea r0, [r0 + 2 * r1] - movq m4, [r0] - punpcklwd m1, m4 ;m1=[1 2] - pmaddwd m1, [r6 + 0 *16] ;m1=[1+2] Row2 - - movq m5, [r0 + r1] - punpcklwd m4, m5 ;m4=[2 3] - pmaddwd m2, m4, [r6 + 0 *16] ;m2=[2+3] Row3 - pmaddwd m4, [r6 + 1 * 16] - paddd m0, m4 ;m0=[0+1+2+3] Row1 done - - lea r0, [r0 + 2 * r1] - movq m4, [r0] - punpcklwd m5, m4 ;m5=[3 4] - pmaddwd m3, m5, [r6 + 0 *16] ;m3=[3+4] Row4 - pmaddwd m5, [r6 + 1 * 16] - paddd m1, m5 ;m1 = [1+2+3+4] Row2 - - movq m5, [r0 + r1] - punpcklwd m4, m5 ;m4=[4 5] - pmaddwd m4, [r6 + 1 * 16] - paddd m2, m4 ;m2=[2+3+4+5] Row3 - - movq m4, [r0 + 2 * r1] - punpcklwd m5, m4 ;m5=[5 6] - pmaddwd m5, [r6 + 1 * 16] - paddd m3, m5 ;m3=[3+4+5+6] Row4 -%endmacro - -;-------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_sp_%1x%2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;-------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_SP 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_sp_%1x%2, 5, 7, 7 ,0-gprsize - - add r1d, r1d - sub r0, r1 - shl r4d, 5 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r6, [r5 + r4] -%else - lea r6, [tab_ChromaCoeffV + r4] -%endif - - mova m6, [v4_pd_526336] - - mov dword [rsp], %2/4 - -.loopH: - mov r4d, (%1/4) -.loopW: - PROCESS_CHROMA_SP_W4_4R - - paddd m0, m6 - paddd m1, m6 - paddd m2, m6 - paddd m3, m6 - - psrad m0, 12 - psrad m1, 12 - psrad m2, 12 - psrad m3, 12 - - packssdw m0, m1 - packssdw m2, m3 - - packuswb m0, m2 - - movd [r2], m0 - pextrd [r2 + r3], m0, 1 - lea r5, [r2 + 2 * r3] - pextrd [r5], m0, 2 - pextrd [r5 + r3], m0, 3 - - lea r5, [4 * r1 - 2 * 4] - sub r0, r5 - add r2, 4 - - dec r4d - jnz .loopW - - lea r0, [r0 + 4 * r1 - 2 * %1] - lea r2, [r2 + 4 * r3 - %1] - - dec dword [rsp] - jnz .loopH - - RET -%endmacro - - FILTER_VER_CHROMA_SP 4, 4 - FILTER_VER_CHROMA_SP 4, 8 - FILTER_VER_CHROMA_SP 16, 16 - FILTER_VER_CHROMA_SP 16, 8 - FILTER_VER_CHROMA_SP 16, 12 - FILTER_VER_CHROMA_SP 12, 16 - FILTER_VER_CHROMA_SP 16, 4 - FILTER_VER_CHROMA_SP 4, 16 - FILTER_VER_CHROMA_SP 32, 32 - FILTER_VER_CHROMA_SP 32, 16 - FILTER_VER_CHROMA_SP 16, 32 - FILTER_VER_CHROMA_SP 32, 24 - FILTER_VER_CHROMA_SP 24, 32 - FILTER_VER_CHROMA_SP 32, 8 - - FILTER_VER_CHROMA_SP 16, 24 - FILTER_VER_CHROMA_SP 16, 64 - FILTER_VER_CHROMA_SP 12, 32 - FILTER_VER_CHROMA_SP 4, 32 - FILTER_VER_CHROMA_SP 32, 64 - FILTER_VER_CHROMA_SP 32, 48 - FILTER_VER_CHROMA_SP 24, 64 - - FILTER_VER_CHROMA_SP 64, 64 - FILTER_VER_CHROMA_SP 64, 32 - FILTER_VER_CHROMA_SP 64, 48 - FILTER_VER_CHROMA_SP 48, 64 - FILTER_VER_CHROMA_SP 64, 16 - - -%macro PROCESS_CHROMA_SP_W2_4R 1 - movd m0, [r0] - movd m1, [r0 + r1] - punpcklwd m0, m1 ;m0=[0 1] - - lea r0, [r0 + 2 * r1] - movd m2, [r0] - punpcklwd m1, m2 ;m1=[1 2] - punpcklqdq m0, m1 ;m0=[0 1 1 2] - pmaddwd m0, [%1 + 0 *16] ;m0=[0+1 1+2] Row 1-2 - - movd m1, [r0 + r1] - punpcklwd m2, m1 ;m2=[2 3] - - lea r0, [r0 + 2 * r1] - movd m3, [r0] - punpcklwd m1, m3 ;m2=[3 4] - punpcklqdq m2, m1 ;m2=[2 3 3 4] - - pmaddwd m4, m2, [%1 + 1 * 16] ;m4=[2+3 3+4] Row 1-2 - pmaddwd m2, [%1 + 0 * 16] ;m2=[2+3 3+4] Row 3-4 - paddd m0, m4 ;m0=[0+1+2+3 1+2+3+4] Row 1-2 - - movd m1, [r0 + r1] - punpcklwd m3, m1 ;m3=[4 5] - - movd m4, [r0 + 2 * r1] - punpcklwd m1, m4 ;m1=[5 6] - punpcklqdq m3, m1 ;m2=[4 5 5 6] - pmaddwd m3, [%1 + 1 * 16] ;m3=[4+5 5+6] Row 3-4 - paddd m2, m3 ;m2=[2+3+4+5 3+4+5+6] Row 3-4 -%endmacro - -;------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vertical_sp_%1x%2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;------------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_SP_W2_4R 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_sp_%1x%2, 5, 6, 6 - - add r1d, r1d - sub r0, r1 - shl r4d, 5 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - - mova m5, [v4_pd_526336] - - mov r4d, (%2/4) - -.loopH: - PROCESS_CHROMA_SP_W2_4R r5 - - paddd m0, m5 - paddd m2, m5 - - psrad m0, 12 - psrad m2, 12 - - packssdw m0, m2 - packuswb m0, m0 - - pextrw [r2], m0, 0 - pextrw [r2 + r3], m0, 1 - lea r2, [r2 + 2 * r3] - pextrw [r2], m0, 2 - pextrw [r2 + r3], m0, 3 - - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loopH - - RET -%endmacro - - FILTER_VER_CHROMA_SP_W2_4R 2, 4 - FILTER_VER_CHROMA_SP_W2_4R 2, 8 - - FILTER_VER_CHROMA_SP_W2_4R 2, 16 - -;-------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_sp_4x2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;-------------------------------------------------------------------------------------------------------------- -INIT_XMM sse4 -cglobal interp_4tap_vert_sp_4x2, 5, 6, 5 - - add r1d, r1d - sub r0, r1 - shl r4d, 5 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - - mova m4, [v4_pd_526336] - - movq m0, [r0] - movq m1, [r0 + r1] - punpcklwd m0, m1 ;m0=[0 1] - pmaddwd m0, [r5 + 0 *16] ;m0=[0+1] Row1 - - lea r0, [r0 + 2 * r1] - movq m2, [r0] - punpcklwd m1, m2 ;m1=[1 2] - pmaddwd m1, [r5 + 0 *16] ;m1=[1+2] Row2 - - movq m3, [r0 + r1] - punpcklwd m2, m3 ;m4=[2 3] - pmaddwd m2, [r5 + 1 * 16] - paddd m0, m2 ;m0=[0+1+2+3] Row1 done - paddd m0, m4 - psrad m0, 12 - - movq m2, [r0 + 2 * r1] - punpcklwd m3, m2 ;m5=[3 4] - pmaddwd m3, [r5 + 1 * 16] - paddd m1, m3 ;m1 = [1+2+3+4] Row2 done - paddd m1, m4 - psrad m1, 12 - - packssdw m0, m1 - packuswb m0, m0 - - movd [r2], m0 - pextrd [r2 + r3], m0, 1 - - RET - -;------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vertical_sp_6x%2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;------------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_SP_W6_H4 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_sp_6x%2, 5, 7, 7 - - add r1d, r1d - sub r0, r1 - shl r4d, 5 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r6, [r5 + r4] -%else - lea r6, [tab_ChromaCoeffV + r4] -%endif - - mova m6, [v4_pd_526336] - - mov r4d, %2/4 - -.loopH: - PROCESS_CHROMA_SP_W4_4R - - paddd m0, m6 - paddd m1, m6 - paddd m2, m6 - paddd m3, m6 - - psrad m0, 12 - psrad m1, 12 - psrad m2, 12 - psrad m3, 12 - - packssdw m0, m1 - packssdw m2, m3 - - packuswb m0, m2 - - movd [r2], m0 - pextrd [r2 + r3], m0, 1 - lea r5, [r2 + 2 * r3] - pextrd [r5], m0, 2 - pextrd [r5 + r3], m0, 3 - - lea r5, [4 * r1 - 2 * 4] - sub r0, r5 - add r2, 4 - - PROCESS_CHROMA_SP_W2_4R r6 - - paddd m0, m6 - paddd m2, m6 - - psrad m0, 12 - psrad m2, 12 - - packssdw m0, m2 - packuswb m0, m0 - - pextrw [r2], m0, 0 - pextrw [r2 + r3], m0, 1 - lea r2, [r2 + 2 * r3] - pextrw [r2], m0, 2 - pextrw [r2 + r3], m0, 3 - - sub r0, 2 * 4 - lea r2, [r2 + 2 * r3 - 4] - - dec r4d - jnz .loopH - - RET -%endmacro - - FILTER_VER_CHROMA_SP_W6_H4 6, 8 - - FILTER_VER_CHROMA_SP_W6_H4 6, 16 - -%macro PROCESS_CHROMA_SP_W8_2R 0 - movu m1, [r0] - movu m3, [r0 + r1] - punpcklwd m0, m1, m3 - pmaddwd m0, [r5 + 0 * 16] ;m0 = [0l+1l] Row1l - punpckhwd m1, m3 - pmaddwd m1, [r5 + 0 * 16] ;m1 = [0h+1h] Row1h - - movu m4, [r0 + 2 * r1] - punpcklwd m2, m3, m4 - pmaddwd m2, [r5 + 0 * 16] ;m2 = [1l+2l] Row2l - punpckhwd m3, m4 - pmaddwd m3, [r5 + 0 * 16] ;m3 = [1h+2h] Row2h - - lea r0, [r0 + 2 * r1] - movu m5, [r0 + r1] - punpcklwd m6, m4, m5 - pmaddwd m6, [r5 + 1 * 16] ;m6 = [2l+3l] Row1l - paddd m0, m6 ;m0 = [0l+1l+2l+3l] Row1l sum - punpckhwd m4, m5 - pmaddwd m4, [r5 + 1 * 16] ;m6 = [2h+3h] Row1h - paddd m1, m4 ;m1 = [0h+1h+2h+3h] Row1h sum - - movu m4, [r0 + 2 * r1] - punpcklwd m6, m5, m4 - pmaddwd m6, [r5 + 1 * 16] ;m6 = [3l+4l] Row2l - paddd m2, m6 ;m2 = [1l+2l+3l+4l] Row2l sum - punpckhwd m5, m4 - pmaddwd m5, [r5 + 1 * 16] ;m1 = [3h+4h] Row2h - paddd m3, m5 ;m3 = [1h+2h+3h+4h] Row2h sum -%endmacro - -;-------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_sp_8x%2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;-------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_SP_W8_H2 2 -INIT_XMM sse2 -cglobal interp_4tap_vert_sp_%1x%2, 5, 6, 8 - - add r1d, r1d - sub r0, r1 - shl r4d, 5 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - - mova m7, [v4_pd_526336] - - mov r4d, %2/2 -.loopH: - PROCESS_CHROMA_SP_W8_2R - - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 - - psrad m0, 12 - psrad m1, 12 - psrad m2, 12 - psrad m3, 12 - - packssdw m0, m1 - packssdw m2, m3 - - packuswb m0, m2 - - movlps [r2], m0 - movhps [r2 + r3], m0 - - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loopH - - RET -%endmacro - - FILTER_VER_CHROMA_SP_W8_H2 8, 2 - FILTER_VER_CHROMA_SP_W8_H2 8, 4 - FILTER_VER_CHROMA_SP_W8_H2 8, 6 - FILTER_VER_CHROMA_SP_W8_H2 8, 8 - FILTER_VER_CHROMA_SP_W8_H2 8, 16 - FILTER_VER_CHROMA_SP_W8_H2 8, 32 - - FILTER_VER_CHROMA_SP_W8_H2 8, 12 - FILTER_VER_CHROMA_SP_W8_H2 8, 64 - - -;--------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_ps_%1x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;--------------------------------------------------------------------------------------------------------------- -%macro FILTER_V_PS_W16n 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_ps_%1x%2, 4, 7, 8 - - mov r4d, r4m - sub r0, r1 - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m1, m0, [tab_Vm] - pshufb m0, [tab_Vm + 16] - mov r4d, %2/2 - -.loop: - - mov r6d, %1/16 - -.loopW: - - movu m2, [r0] - movu m3, [r0 + r1] - - punpcklbw m4, m2, m3 - punpckhbw m2, m3 - - pmaddubsw m4, m1 - pmaddubsw m2, m1 - - lea r5, [r0 + 2 * r1] - movu m5, [r5] - movu m7, [r5 + r1] - - punpcklbw m6, m5, m7 - pmaddubsw m6, m0 - paddw m4, m6 - - punpckhbw m6, m5, m7 - pmaddubsw m6, m0 - paddw m2, m6 - - mova m6, [pw_2000] - - psubw m4, m6 - psubw m2, m6 - - movu [r2], m4 - movu [r2 + 16], m2 - - punpcklbw m4, m3, m5 - punpckhbw m3, m5 - - pmaddubsw m4, m1 - pmaddubsw m3, m1 - - movu m5, [r5 + 2 * r1] - - punpcklbw m2, m7, m5 - punpckhbw m7, m5 - - pmaddubsw m2, m0 - pmaddubsw m7, m0 - - paddw m4, m2 - paddw m3, m7 - - psubw m4, m6 - psubw m3, m6 - - movu [r2 + r3], m4 - movu [r2 + r3 + 16], m3 - - add r0, 16 - add r2, 32 - dec r6d - jnz .loopW - - lea r0, [r0 + r1 * 2 - %1] - lea r2, [r2 + r3 * 2 - %1 * 2] - - dec r4d - jnz .loop - RET -%endmacro - - FILTER_V_PS_W16n 64, 64 - FILTER_V_PS_W16n 64, 32 - FILTER_V_PS_W16n 64, 48 - FILTER_V_PS_W16n 48, 64 - FILTER_V_PS_W16n 64, 16 - - -;------------------------------------------------------------------------------------------------------------ -;void interp_4tap_vert_ps_2x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;------------------------------------------------------------------------------------------------------------ -INIT_XMM sse4 -cglobal interp_4tap_vert_ps_2x4, 4, 6, 7 - - mov r4d, r4m - sub r0, r1 - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m0, [tab_Cm] - - lea r5, [3 * r1] - - movd m2, [r0] - movd m3, [r0 + r1] - movd m4, [r0 + 2 * r1] - movd m5, [r0 + r5] - - punpcklbw m2, m3 - punpcklbw m6, m4, m5 - punpcklbw m2, m6 - - pmaddubsw m2, m0 - - lea r0, [r0 + 4 * r1] - movd m6, [r0] - - punpcklbw m3, m4 - punpcklbw m1, m5, m6 - punpcklbw m3, m1 - - pmaddubsw m3, m0 - phaddw m2, m3 - - mova m1, [pw_2000] - - psubw m2, m1 - - movd [r2], m2 - pextrd [r2 + r3], m2, 2 - - movd m2, [r0 + r1] - - punpcklbw m4, m5 - punpcklbw m3, m6, m2 - punpcklbw m4, m3 - - pmaddubsw m4, m0 - - movd m3, [r0 + 2 * r1] - - punpcklbw m5, m6 - punpcklbw m2, m3 - punpcklbw m5, m2 - - pmaddubsw m5, m0 - phaddw m4, m5 - psubw m4, m1 - - lea r2, [r2 + 2 * r3] - movd [r2], m4 - pextrd [r2 + r3], m4, 2 - - RET - -;------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_ps_2x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;------------------------------------------------------------------------------------------------------------- -%macro FILTER_V_PS_W2 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_ps_2x%2, 4, 6, 8 - - mov r4d, r4m - sub r0, r1 - add r3d, r3d - -%ifdef PIC - lea r5, [tab_ChromaCoeff] - movd m0, [r5 + r4 * 4] -%else - movd m0, [tab_ChromaCoeff + r4 * 4] -%endif - - pshufb m0, [tab_Cm] - - mova m1, [pw_2000] - lea r5, [3 * r1] - mov r4d, %2/4 -.loop: - movd m2, [r0] - movd m3, [r0 + r1] - movd m4, [r0 + 2 * r1] - movd m5, [r0 + r5] - - punpcklbw m2, m3 - punpcklbw m6, m4, m5 - punpcklbw m2, m6 - - pmaddubsw m2, m0 - - lea r0, [r0 + 4 * r1] - movd m6, [r0] - - punpcklbw m3, m4 - punpcklbw m7, m5, m6 - punpcklbw m3, m7 - - pmaddubsw m3, m0 - - phaddw m2, m3 - psubw m2, m1 - - - movd [r2], m2 - pshufd m2, m2, 2 - movd [r2 + r3], m2 - - movd m2, [r0 + r1] - - punpcklbw m4, m5 - punpcklbw m3, m6, m2 - punpcklbw m4, m3 - - pmaddubsw m4, m0 - - movd m3, [r0 + 2 * r1] - - punpcklbw m5, m6 - punpcklbw m2, m3 - punpcklbw m5, m2 - - pmaddubsw m5, m0 - - phaddw m4, m5 - - psubw m4, m1 - - lea r2, [r2 + 2 * r3] - movd [r2], m4 - pshufd m4 , m4 ,2 - movd [r2 + r3], m4 - - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loop - - RET -%endmacro - - FILTER_V_PS_W2 2, 8 - - FILTER_V_PS_W2 2, 16 - -;----------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_SS 2 -INIT_XMM sse2 -cglobal interp_4tap_vert_ss_%1x%2, 5, 7, 6 ,0-gprsize - - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 5 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r6, [r5 + r4] -%else - lea r6, [tab_ChromaCoeffV + r4] -%endif - - mov dword [rsp], %2/4 - -.loopH: - mov r4d, (%1/4) -.loopW: - PROCESS_CHROMA_SP_W4_4R - - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - - packssdw m0, m1 - packssdw m2, m3 - - movlps [r2], m0 - movhps [r2 + r3], m0 - lea r5, [r2 + 2 * r3] - movlps [r5], m2 - movhps [r5 + r3], m2 - - lea r5, [4 * r1 - 2 * 4] - sub r0, r5 - add r2, 2 * 4 - - dec r4d - jnz .loopW - - lea r0, [r0 + 4 * r1 - 2 * %1] - lea r2, [r2 + 4 * r3 - 2 * %1] - - dec dword [rsp] - jnz .loopH - - RET -%endmacro - - FILTER_VER_CHROMA_SS 4, 4 - FILTER_VER_CHROMA_SS 4, 8 - FILTER_VER_CHROMA_SS 16, 16 - FILTER_VER_CHROMA_SS 16, 8 - FILTER_VER_CHROMA_SS 16, 12 - FILTER_VER_CHROMA_SS 12, 16 - FILTER_VER_CHROMA_SS 16, 4 - FILTER_VER_CHROMA_SS 4, 16 - FILTER_VER_CHROMA_SS 32, 32 - FILTER_VER_CHROMA_SS 32, 16 - FILTER_VER_CHROMA_SS 16, 32 - FILTER_VER_CHROMA_SS 32, 24 - FILTER_VER_CHROMA_SS 24, 32 - FILTER_VER_CHROMA_SS 32, 8 - - FILTER_VER_CHROMA_SS 16, 24 - FILTER_VER_CHROMA_SS 12, 32 - FILTER_VER_CHROMA_SS 4, 32 - FILTER_VER_CHROMA_SS 32, 64 - FILTER_VER_CHROMA_SS 16, 64 - FILTER_VER_CHROMA_SS 32, 48 - FILTER_VER_CHROMA_SS 24, 64 - - FILTER_VER_CHROMA_SS 64, 64 - FILTER_VER_CHROMA_SS 64, 32 - FILTER_VER_CHROMA_SS 64, 48 - FILTER_VER_CHROMA_SS 48, 64 - FILTER_VER_CHROMA_SS 64, 16 - -%macro FILTER_VER_CHROMA_S_AVX2_4x4 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_4x4, 4, 6, 7 - mov r4d, r4m - add r1d, r1d - shl r4d, 6 - sub r0, r1 - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] -%ifidn %1,sp - mova m6, [v4_pd_526336] -%else - add r3d, r3d -%endif - - movq xm0, [r0] - movq xm1, [r0 + r1] - punpcklwd xm0, xm1 - movq xm2, [r0 + r1 * 2] - punpcklwd xm1, xm2 - vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] - pmaddwd m0, [r5] - movq xm3, [r0 + r4] - punpcklwd xm2, xm3 - lea r0, [r0 + 4 * r1] - movq xm4, [r0] - punpcklwd xm3, xm4 - vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] - pmaddwd m5, m2, [r5 + 1 * mmsize] - pmaddwd m2, [r5] - paddd m0, m5 - movq xm3, [r0 + r1] - punpcklwd xm4, xm3 - movq xm1, [r0 + r1 * 2] - punpcklwd xm3, xm1 - vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] - pmaddwd m4, [r5 + 1 * mmsize] - paddd m2, m4 - -%ifidn %1,sp - paddd m0, m6 - paddd m2, m6 - psrad m0, 12 - psrad m2, 12 -%else - psrad m0, 6 - psrad m2, 6 -%endif - packssdw m0, m2 - vextracti128 xm2, m0, 1 - lea r4, [r3 * 3] - -%ifidn %1,sp - packuswb xm0, xm2 - movd [r2], xm0 - pextrd [r2 + r3], xm0, 2 - pextrd [r2 + r3 * 2], xm0, 1 - pextrd [r2 + r4], xm0, 3 -%else - movq [r2], xm0 - movq [r2 + r3], xm2 - movhps [r2 + r3 * 2], xm0 - movhps [r2 + r4], xm2 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_S_AVX2_4x4 sp - FILTER_VER_CHROMA_S_AVX2_4x4 ss - -%macro FILTER_VER_CHROMA_S_AVX2_4x8 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_4x8, 4, 6, 8 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - sub r0, r1 - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] -%ifidn %1,sp - mova m7, [v4_pd_526336] -%else - add r3d, r3d -%endif - - movq xm0, [r0] - movq xm1, [r0 + r1] - punpcklwd xm0, xm1 - movq xm2, [r0 + r1 * 2] - punpcklwd xm1, xm2 - vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] - pmaddwd m0, [r5] - movq xm3, [r0 + r4] - punpcklwd xm2, xm3 - lea r0, [r0 + 4 * r1] - movq xm4, [r0] - punpcklwd xm3, xm4 - vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] - pmaddwd m5, m2, [r5 + 1 * mmsize] - pmaddwd m2, [r5] - paddd m0, m5 - movq xm3, [r0 + r1] - punpcklwd xm4, xm3 - movq xm1, [r0 + r1 * 2] - punpcklwd xm3, xm1 - vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] - pmaddwd m5, m4, [r5 + 1 * mmsize] - paddd m2, m5 - pmaddwd m4, [r5] - movq xm3, [r0 + r4] - punpcklwd xm1, xm3 - lea r0, [r0 + 4 * r1] - movq xm6, [r0] - punpcklwd xm3, xm6 - vinserti128 m1, m1, xm3, 1 ; m1 = [8 7 7 6] - pmaddwd m5, m1, [r5 + 1 * mmsize] - paddd m4, m5 - pmaddwd m1, [r5] - movq xm3, [r0 + r1] - punpcklwd xm6, xm3 - movq xm5, [r0 + 2 * r1] - punpcklwd xm3, xm5 - vinserti128 m6, m6, xm3, 1 ; m6 = [A 9 9 8] - pmaddwd m6, [r5 + 1 * mmsize] - paddd m1, m6 - lea r4, [r3 * 3] - -%ifidn %1,sp - paddd m0, m7 - paddd m2, m7 - paddd m4, m7 - paddd m1, m7 - psrad m0, 12 - psrad m2, 12 - psrad m4, 12 - psrad m1, 12 -%else - psrad m0, 6 - psrad m2, 6 - psrad m4, 6 - psrad m1, 6 -%endif - packssdw m0, m2 - packssdw m4, m1 -%ifidn %1,sp - packuswb m0, m4 - vextracti128 xm2, m0, 1 - movd [r2], xm0 - movd [r2 + r3], xm2 - pextrd [r2 + r3 * 2], xm0, 1 - pextrd [r2 + r4], xm2, 1 - lea r2, [r2 + r3 * 4] - pextrd [r2], xm0, 2 - pextrd [r2 + r3], xm2, 2 - pextrd [r2 + r3 * 2], xm0, 3 - pextrd [r2 + r4], xm2, 3 -%else - vextracti128 xm2, m0, 1 - vextracti128 xm1, m4, 1 - movq [r2], xm0 - movq [r2 + r3], xm2 - movhps [r2 + r3 * 2], xm0 - movhps [r2 + r4], xm2 - lea r2, [r2 + r3 * 4] - movq [r2], xm4 - movq [r2 + r3], xm1 - movhps [r2 + r3 * 2], xm4 - movhps [r2 + r4], xm1 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_S_AVX2_4x8 sp - FILTER_VER_CHROMA_S_AVX2_4x8 ss - -%macro PROCESS_CHROMA_AVX2_W4_16R 1 - movq xm0, [r0] - movq xm1, [r0 + r1] - punpcklwd xm0, xm1 - movq xm2, [r0 + r1 * 2] - punpcklwd xm1, xm2 - vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] - pmaddwd m0, [r5] - movq xm3, [r0 + r4] - punpcklwd xm2, xm3 - lea r0, [r0 + 4 * r1] - movq xm4, [r0] - punpcklwd xm3, xm4 - vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] - pmaddwd m5, m2, [r5 + 1 * mmsize] - pmaddwd m2, [r5] - paddd m0, m5 - movq xm3, [r0 + r1] - punpcklwd xm4, xm3 - movq xm1, [r0 + r1 * 2] - punpcklwd xm3, xm1 - vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] - pmaddwd m5, m4, [r5 + 1 * mmsize] - paddd m2, m5 - pmaddwd m4, [r5] - movq xm3, [r0 + r4] - punpcklwd xm1, xm3 - lea r0, [r0 + 4 * r1] - movq xm6, [r0] - punpcklwd xm3, xm6 - vinserti128 m1, m1, xm3, 1 ; m1 = [8 7 7 6] - pmaddwd m5, m1, [r5 + 1 * mmsize] - paddd m4, m5 - pmaddwd m1, [r5] - movq xm3, [r0 + r1] - punpcklwd xm6, xm3 - movq xm5, [r0 + 2 * r1] - punpcklwd xm3, xm5 - vinserti128 m6, m6, xm3, 1 ; m6 = [10 9 9 8] - pmaddwd m3, m6, [r5 + 1 * mmsize] - paddd m1, m3 - pmaddwd m6, [r5] - -%ifidn %1,sp - paddd m0, m7 - paddd m2, m7 - paddd m4, m7 - paddd m1, m7 - psrad m4, 12 - psrad m1, 12 - psrad m0, 12 - psrad m2, 12 -%else - psrad m0, 6 - psrad m2, 6 - psrad m4, 6 - psrad m1, 6 -%endif - packssdw m0, m2 - packssdw m4, m1 -%ifidn %1,sp - packuswb m0, m4 - vextracti128 xm4, m0, 1 - movd [r2], xm0 - movd [r2 + r3], xm4 - pextrd [r2 + r3 * 2], xm0, 1 - pextrd [r2 + r6], xm4, 1 - lea r2, [r2 + r3 * 4] - pextrd [r2], xm0, 2 - pextrd [r2 + r3], xm4, 2 - pextrd [r2 + r3 * 2], xm0, 3 - pextrd [r2 + r6], xm4, 3 -%else - vextracti128 xm2, m0, 1 - vextracti128 xm1, m4, 1 - movq [r2], xm0 - movq [r2 + r3], xm2 - movhps [r2 + r3 * 2], xm0 - movhps [r2 + r6], xm2 - lea r2, [r2 + r3 * 4] - movq [r2], xm4 - movq [r2 + r3], xm1 - movhps [r2 + r3 * 2], xm4 - movhps [r2 + r6], xm1 -%endif - - movq xm2, [r0 + r4] - punpcklwd xm5, xm2 - lea r0, [r0 + 4 * r1] - movq xm0, [r0] - punpcklwd xm2, xm0 - vinserti128 m5, m5, xm2, 1 ; m5 = [12 11 11 10] - pmaddwd m2, m5, [r5 + 1 * mmsize] - paddd m6, m2 - pmaddwd m5, [r5] - movq xm2, [r0 + r1] - punpcklwd xm0, xm2 - movq xm3, [r0 + 2 * r1] - punpcklwd xm2, xm3 - vinserti128 m0, m0, xm2, 1 ; m0 = [14 13 13 12] - pmaddwd m2, m0, [r5 + 1 * mmsize] - paddd m5, m2 - pmaddwd m0, [r5] - movq xm4, [r0 + r4] - punpcklwd xm3, xm4 - lea r0, [r0 + 4 * r1] - movq xm1, [r0] - punpcklwd xm4, xm1 - vinserti128 m3, m3, xm4, 1 ; m3 = [16 15 15 14] - pmaddwd m4, m3, [r5 + 1 * mmsize] - paddd m0, m4 - pmaddwd m3, [r5] - movq xm4, [r0 + r1] - punpcklwd xm1, xm4 - movq xm2, [r0 + 2 * r1] - punpcklwd xm4, xm2 - vinserti128 m1, m1, xm4, 1 ; m1 = [18 17 17 16] - pmaddwd m1, [r5 + 1 * mmsize] - paddd m3, m1 - -%ifidn %1,sp - paddd m6, m7 - paddd m5, m7 - paddd m0, m7 - paddd m3, m7 - psrad m6, 12 - psrad m5, 12 - psrad m0, 12 - psrad m3, 12 -%else - psrad m6, 6 - psrad m5, 6 - psrad m0, 6 - psrad m3, 6 -%endif - packssdw m6, m5 - packssdw m0, m3 - lea r2, [r2 + r3 * 4] - -%ifidn %1,sp - packuswb m6, m0 - vextracti128 xm0, m6, 1 - movd [r2], xm6 - movd [r2 + r3], xm0 - pextrd [r2 + r3 * 2], xm6, 1 - pextrd [r2 + r6], xm0, 1 - lea r2, [r2 + r3 * 4] - pextrd [r2], xm6, 2 - pextrd [r2 + r3], xm0, 2 - pextrd [r2 + r3 * 2], xm6, 3 - pextrd [r2 + r6], xm0, 3 -%else - vextracti128 xm5, m6, 1 - vextracti128 xm3, m0, 1 - movq [r2], xm6 - movq [r2 + r3], xm5 - movhps [r2 + r3 * 2], xm6 - movhps [r2 + r6], xm5 - lea r2, [r2 + r3 * 4] - movq [r2], xm0 - movq [r2 + r3], xm3 - movhps [r2 + r3 * 2], xm0 - movhps [r2 + r6], xm3 -%endif -%endmacro - -%macro FILTER_VER_CHROMA_S_AVX2_4x16 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_4x16, 4, 7, 8 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - sub r0, r1 - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] -%ifidn %1,sp - mova m7, [v4_pd_526336] -%else - add r3d, r3d -%endif - lea r6, [r3 * 3] - PROCESS_CHROMA_AVX2_W4_16R %1 - RET -%endmacro - - FILTER_VER_CHROMA_S_AVX2_4x16 sp - FILTER_VER_CHROMA_S_AVX2_4x16 ss - -%macro FILTER_VER_CHROMA_S_AVX2_4x32 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_4x32, 4, 7, 8 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - sub r0, r1 - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] -%ifidn %1,sp - mova m7, [v4_pd_526336] -%else - add r3d, r3d -%endif - lea r6, [r3 * 3] -%rep 2 - PROCESS_CHROMA_AVX2_W4_16R %1 - lea r2, [r2 + r3 * 4] -%endrep - RET -%endmacro - - FILTER_VER_CHROMA_S_AVX2_4x32 sp - FILTER_VER_CHROMA_S_AVX2_4x32 ss - -%macro FILTER_VER_CHROMA_S_AVX2_4x2 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_4x2, 4, 6, 6 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - sub r0, r1 - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] -%ifidn %1,sp - mova m5, [v4_pd_526336] -%else - add r3d, r3d -%endif - movq xm0, [r0] - movq xm1, [r0 + r1] - punpcklwd xm0, xm1 - movq xm2, [r0 + r1 * 2] - punpcklwd xm1, xm2 - vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] - pmaddwd m0, [r5] - movq xm3, [r0 + r4] - punpcklwd xm2, xm3 - movq xm4, [r0 + 4 * r1] - punpcklwd xm3, xm4 - vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] - pmaddwd m2, [r5 + 1 * mmsize] - paddd m0, m2 -%ifidn %1,sp - paddd m0, m5 - psrad m0, 12 -%else - psrad m0, 6 -%endif - vextracti128 xm1, m0, 1 - packssdw xm0, xm1 -%ifidn %1,sp - packuswb xm0, xm0 - movd [r2], xm0 - pextrd [r2 + r3], xm0, 1 -%else - movq [r2], xm0 - movhps [r2 + r3], xm0 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_S_AVX2_4x2 sp - FILTER_VER_CHROMA_S_AVX2_4x2 ss - -%macro FILTER_VER_CHROMA_S_AVX2_2x4 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_2x4, 4, 6, 6 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - sub r0, r1 - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] -%ifidn %1,sp - mova m5, [v4_pd_526336] -%else - add r3d, r3d -%endif - movd xm0, [r0] - movd xm1, [r0 + r1] - punpcklwd xm0, xm1 - movd xm2, [r0 + r1 * 2] - punpcklwd xm1, xm2 - punpcklqdq xm0, xm1 ; m0 = [2 1 1 0] - movd xm3, [r0 + r4] - punpcklwd xm2, xm3 - lea r0, [r0 + 4 * r1] - movd xm4, [r0] - punpcklwd xm3, xm4 - punpcklqdq xm2, xm3 ; m2 = [4 3 3 2] - vinserti128 m0, m0, xm2, 1 ; m0 = [4 3 3 2 2 1 1 0] - movd xm1, [r0 + r1] - punpcklwd xm4, xm1 - movd xm3, [r0 + r1 * 2] - punpcklwd xm1, xm3 - punpcklqdq xm4, xm1 ; m4 = [6 5 5 4] - vinserti128 m2, m2, xm4, 1 ; m2 = [6 5 5 4 4 3 3 2] - pmaddwd m0, [r5] - pmaddwd m2, [r5 + 1 * mmsize] - paddd m0, m2 -%ifidn %1,sp - paddd m0, m5 - psrad m0, 12 -%else - psrad m0, 6 -%endif - vextracti128 xm1, m0, 1 - packssdw xm0, xm1 - lea r4, [r3 * 3] -%ifidn %1,sp - packuswb xm0, xm0 - pextrw [r2], xm0, 0 - pextrw [r2 + r3], xm0, 1 - pextrw [r2 + 2 * r3], xm0, 2 - pextrw [r2 + r4], xm0, 3 -%else - movd [r2], xm0 - pextrd [r2 + r3], xm0, 1 - pextrd [r2 + 2 * r3], xm0, 2 - pextrd [r2 + r4], xm0, 3 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_S_AVX2_2x4 sp - FILTER_VER_CHROMA_S_AVX2_2x4 ss - -%macro FILTER_VER_CHROMA_S_AVX2_8x8 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x8, 4, 6, 8 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,sp - mova m7, [v4_pd_526336] -%else - add r3d, r3d -%endif - - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - pmaddwd m2, [r5] - paddd m0, m4 - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - pmaddwd m3, [r5] - paddd m1, m5 -%ifidn %1,sp - paddd m0, m7 - paddd m1, m7 - psrad m0, 12 - psrad m1, 12 -%else - psrad m0, 6 - psrad m1, 6 -%endif - packssdw m0, m1 - - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m2, m6 - pmaddwd m4, [r5] - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhwd xm1, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm1, 1 - pmaddwd m1, m5, [r5 + 1 * mmsize] - pmaddwd m5, [r5] - paddd m3, m1 -%ifidn %1,sp - paddd m2, m7 - paddd m3, m7 - psrad m2, 12 - psrad m3, 12 -%else - psrad m2, 6 - psrad m3, 6 -%endif - packssdw m2, m3 - - movu xm1, [r0 + r4] ; m1 = row 7 - punpckhwd xm3, xm6, xm1 - punpcklwd xm6, xm1 - vinserti128 m6, m6, xm3, 1 - pmaddwd m3, m6, [r5 + 1 * mmsize] - pmaddwd m6, [r5] - paddd m4, m3 - - lea r4, [r3 * 3] -%ifidn %1,sp - packuswb m0, m2 - mova m3, [v4_interp8_hps_shuf] - vpermd m0, m3, m0 - vextracti128 xm2, m0, 1 - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r4], xm2 -%else - vpermq m0, m0, 11011000b - vpermq m2, m2, 11011000b - movu [r2], xm0 - vextracti128 xm0, m0, 1 - vextracti128 xm3, m2, 1 - movu [r2 + r3], xm0 - movu [r2 + r3 * 2], xm2 - movu [r2 + r4], xm3 -%endif - lea r2, [r2 + r3 * 4] - lea r0, [r0 + r1 * 4] - movu xm0, [r0] ; m0 = row 8 - punpckhwd xm2, xm1, xm0 - punpcklwd xm1, xm0 - vinserti128 m1, m1, xm2, 1 - pmaddwd m2, m1, [r5 + 1 * mmsize] - pmaddwd m1, [r5] - paddd m5, m2 -%ifidn %1,sp - paddd m4, m7 - paddd m5, m7 - psrad m4, 12 - psrad m5, 12 -%else - psrad m4, 6 - psrad m5, 6 -%endif - packssdw m4, m5 - - movu xm2, [r0 + r1] ; m2 = row 9 - punpckhwd xm5, xm0, xm2 - punpcklwd xm0, xm2 - vinserti128 m0, m0, xm5, 1 - pmaddwd m0, [r5 + 1 * mmsize] - paddd m6, m0 - movu xm5, [r0 + r1 * 2] ; m5 = row 10 - punpckhwd xm0, xm2, xm5 - punpcklwd xm2, xm5 - vinserti128 m2, m2, xm0, 1 - pmaddwd m2, [r5 + 1 * mmsize] - paddd m1, m2 - -%ifidn %1,sp - paddd m6, m7 - paddd m1, m7 - psrad m6, 12 - psrad m1, 12 -%else - psrad m6, 6 - psrad m1, 6 -%endif - packssdw m6, m1 -%ifidn %1,sp - packuswb m4, m6 - vpermd m4, m3, m4 - vextracti128 xm6, m4, 1 - movq [r2], xm4 - movhps [r2 + r3], xm4 - movq [r2 + r3 * 2], xm6 - movhps [r2 + r4], xm6 -%else - vpermq m4, m4, 11011000b - vpermq m6, m6, 11011000b - vextracti128 xm5, m4, 1 - vextracti128 xm1, m6, 1 - movu [r2], xm4 - movu [r2 + r3], xm5 - movu [r2 + r3 * 2], xm6 - movu [r2 + r4], xm1 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_S_AVX2_8x8 sp - FILTER_VER_CHROMA_S_AVX2_8x8 ss - -%macro PROCESS_CHROMA_S_AVX2_W8_16R 1 - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - paddd m0, m4 - pmaddwd m2, [r5] - lea r7, [r0 + r1 * 4] - movu xm4, [r7] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - paddd m1, m5 - pmaddwd m3, [r5] - movu xm5, [r7 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m2, m6 - pmaddwd m4, [r5] - movu xm6, [r7 + r1 * 2] ; m6 = row 6 - punpckhwd xm7, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm7, 1 - pmaddwd m7, m5, [r5 + 1 * mmsize] - paddd m3, m7 - pmaddwd m5, [r5] -%ifidn %1,sp - paddd m0, m9 - paddd m1, m9 - paddd m2, m9 - paddd m3, m9 - psrad m0, 12 - psrad m1, 12 - psrad m2, 12 - psrad m3, 12 -%else - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 -%endif - packssdw m0, m1 - packssdw m2, m3 -%ifidn %1,sp - packuswb m0, m2 - mova m3, [v4_interp8_hps_shuf] - vpermd m0, m3, m0 - vextracti128 xm2, m0, 1 - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r6], xm2 -%else - vpermq m0, m0, 11011000b - vpermq m2, m2, 11011000b - vextracti128 xm1, m0, 1 - vextracti128 xm3, m2, 1 - movu [r2], xm0 - movu [r2 + r3], xm1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm3 -%endif - - movu xm7, [r7 + r4] ; m7 = row 7 - punpckhwd xm8, xm6, xm7 - punpcklwd xm6, xm7 - vinserti128 m6, m6, xm8, 1 - pmaddwd m8, m6, [r5 + 1 * mmsize] - paddd m4, m8 - pmaddwd m6, [r5] - lea r7, [r7 + r1 * 4] - movu xm8, [r7] ; m8 = row 8 - punpckhwd xm0, xm7, xm8 - punpcklwd xm7, xm8 - vinserti128 m7, m7, xm0, 1 - pmaddwd m0, m7, [r5 + 1 * mmsize] - paddd m5, m0 - pmaddwd m7, [r5] - movu xm0, [r7 + r1] ; m0 = row 9 - punpckhwd xm1, xm8, xm0 - punpcklwd xm8, xm0 - vinserti128 m8, m8, xm1, 1 - pmaddwd m1, m8, [r5 + 1 * mmsize] - paddd m6, m1 - pmaddwd m8, [r5] - movu xm1, [r7 + r1 * 2] ; m1 = row 10 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m2, m0, [r5 + 1 * mmsize] - paddd m7, m2 - pmaddwd m0, [r5] -%ifidn %1,sp - paddd m4, m9 - paddd m5, m9 - psrad m4, 12 - psrad m5, 12 - paddd m6, m9 - paddd m7, m9 - psrad m6, 12 - psrad m7, 12 -%else - psrad m4, 6 - psrad m5, 6 - psrad m6, 6 - psrad m7, 6 -%endif - packssdw m4, m5 - packssdw m6, m7 - lea r8, [r2 + r3 * 4] -%ifidn %1,sp - packuswb m4, m6 - vpermd m4, m3, m4 - vextracti128 xm6, m4, 1 - movq [r8], xm4 - movhps [r8 + r3], xm4 - movq [r8 + r3 * 2], xm6 - movhps [r8 + r6], xm6 -%else - vpermq m4, m4, 11011000b - vpermq m6, m6, 11011000b - vextracti128 xm5, m4, 1 - vextracti128 xm7, m6, 1 - movu [r8], xm4 - movu [r8 + r3], xm5 - movu [r8 + r3 * 2], xm6 - movu [r8 + r6], xm7 -%endif - - movu xm2, [r7 + r4] ; m2 = row 11 - punpckhwd xm4, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm4, 1 - pmaddwd m4, m1, [r5 + 1 * mmsize] - paddd m8, m4 - pmaddwd m1, [r5] - lea r7, [r7 + r1 * 4] - movu xm4, [r7] ; m4 = row 12 - punpckhwd xm5, xm2, xm4 - punpcklwd xm2, xm4 - vinserti128 m2, m2, xm5, 1 - pmaddwd m5, m2, [r5 + 1 * mmsize] - paddd m0, m5 - pmaddwd m2, [r5] - movu xm5, [r7 + r1] ; m5 = row 13 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m1, m6 - pmaddwd m4, [r5] - movu xm6, [r7 + r1 * 2] ; m6 = row 14 - punpckhwd xm7, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm7, 1 - pmaddwd m7, m5, [r5 + 1 * mmsize] - paddd m2, m7 - pmaddwd m5, [r5] -%ifidn %1,sp - paddd m8, m9 - paddd m0, m9 - paddd m1, m9 - paddd m2, m9 - psrad m8, 12 - psrad m0, 12 - psrad m1, 12 - psrad m2, 12 -%else - psrad m8, 6 - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 -%endif - packssdw m8, m0 - packssdw m1, m2 - lea r8, [r8 + r3 * 4] -%ifidn %1,sp - packuswb m8, m1 - vpermd m8, m3, m8 - vextracti128 xm1, m8, 1 - movq [r8], xm8 - movhps [r8 + r3], xm8 - movq [r8 + r3 * 2], xm1 - movhps [r8 + r6], xm1 -%else - vpermq m8, m8, 11011000b - vpermq m1, m1, 11011000b - vextracti128 xm0, m8, 1 - vextracti128 xm2, m1, 1 - movu [r8], xm8 - movu [r8 + r3], xm0 - movu [r8 + r3 * 2], xm1 - movu [r8 + r6], xm2 -%endif - lea r8, [r8 + r3 * 4] - - movu xm7, [r7 + r4] ; m7 = row 15 - punpckhwd xm2, xm6, xm7 - punpcklwd xm6, xm7 - vinserti128 m6, m6, xm2, 1 - pmaddwd m2, m6, [r5 + 1 * mmsize] - paddd m4, m2 - pmaddwd m6, [r5] - lea r7, [r7 + r1 * 4] - movu xm2, [r7] ; m2 = row 16 - punpckhwd xm1, xm7, xm2 - punpcklwd xm7, xm2 - vinserti128 m7, m7, xm1, 1 - pmaddwd m1, m7, [r5 + 1 * mmsize] - paddd m5, m1 - pmaddwd m7, [r5] - movu xm1, [r7 + r1] ; m1 = row 17 - punpckhwd xm0, xm2, xm1 - punpcklwd xm2, xm1 - vinserti128 m2, m2, xm0, 1 - pmaddwd m2, [r5 + 1 * mmsize] - paddd m6, m2 - movu xm0, [r7 + r1 * 2] ; m0 = row 18 - punpckhwd xm2, xm1, xm0 - punpcklwd xm1, xm0 - vinserti128 m1, m1, xm2, 1 - pmaddwd m1, [r5 + 1 * mmsize] - paddd m7, m1 - -%ifidn %1,sp - paddd m4, m9 - paddd m5, m9 - paddd m6, m9 - paddd m7, m9 - psrad m4, 12 - psrad m5, 12 - psrad m6, 12 - psrad m7, 12 -%else - psrad m4, 6 - psrad m5, 6 - psrad m6, 6 - psrad m7, 6 -%endif - packssdw m4, m5 - packssdw m6, m7 -%ifidn %1,sp - packuswb m4, m6 - vpermd m4, m3, m4 - vextracti128 xm6, m4, 1 - movq [r8], xm4 - movhps [r8 + r3], xm4 - movq [r8 + r3 * 2], xm6 - movhps [r8 + r6], xm6 -%else - vpermq m4, m4, 11011000b - vpermq m6, m6, 11011000b - vextracti128 xm5, m4, 1 - vextracti128 xm7, m6, 1 - movu [r8], xm4 - movu [r8 + r3], xm5 - movu [r8 + r3 * 2], xm6 - movu [r8 + r6], xm7 -%endif -%endmacro - -%macro FILTER_VER_CHROMA_S_AVX2_Nx16 2 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_%2x16, 4, 10, 10 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,sp - mova m9, [v4_pd_526336] -%else - add r3d, r3d -%endif - lea r6, [r3 * 3] - mov r9d, %2 / 8 -.loopW: - PROCESS_CHROMA_S_AVX2_W8_16R %1 -%ifidn %1,sp - add r2, 8 -%else - add r2, 16 -%endif - add r0, 16 - dec r9d - jnz .loopW - RET -%endif -%endmacro - - FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 16 - FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 32 - FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 64 - FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 16 - FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 32 - FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 64 - -%macro FILTER_VER_CHROMA_S_AVX2_NxN 3 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%3_%1x%2, 4, 11, 10 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %3,sp - mova m9, [v4_pd_526336] -%else - add r3d, r3d -%endif - lea r6, [r3 * 3] - mov r9d, %2 / 16 -.loopH: - mov r10d, %1 / 8 -.loopW: - PROCESS_CHROMA_S_AVX2_W8_16R %3 -%ifidn %3,sp - add r2, 8 -%else - add r2, 16 -%endif - add r0, 16 - dec r10d - jnz .loopW - lea r0, [r7 - 2 * %1 + 16] -%ifidn %3,sp - lea r2, [r8 + r3 * 4 - %1 + 8] -%else - lea r2, [r8 + r3 * 4 - 2 * %1 + 16] -%endif - dec r9d - jnz .loopH - RET -%endif -%endmacro - - FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, sp - FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, sp - FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, sp - FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, ss - FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, ss - FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, ss - FILTER_VER_CHROMA_S_AVX2_NxN 16, 64, sp - FILTER_VER_CHROMA_S_AVX2_NxN 24, 64, sp - FILTER_VER_CHROMA_S_AVX2_NxN 32, 64, sp - FILTER_VER_CHROMA_S_AVX2_NxN 32, 48, sp - FILTER_VER_CHROMA_S_AVX2_NxN 32, 48, ss - FILTER_VER_CHROMA_S_AVX2_NxN 16, 64, ss - FILTER_VER_CHROMA_S_AVX2_NxN 24, 64, ss - FILTER_VER_CHROMA_S_AVX2_NxN 32, 64, ss - FILTER_VER_CHROMA_S_AVX2_NxN 64, 64, sp - FILTER_VER_CHROMA_S_AVX2_NxN 64, 32, sp - FILTER_VER_CHROMA_S_AVX2_NxN 64, 48, sp - FILTER_VER_CHROMA_S_AVX2_NxN 48, 64, sp - FILTER_VER_CHROMA_S_AVX2_NxN 64, 64, ss - FILTER_VER_CHROMA_S_AVX2_NxN 64, 32, ss - FILTER_VER_CHROMA_S_AVX2_NxN 64, 48, ss - FILTER_VER_CHROMA_S_AVX2_NxN 48, 64, ss - -%macro PROCESS_CHROMA_S_AVX2_W8_4R 1 - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - paddd m0, m4 - pmaddwd m2, [r5] - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - paddd m1, m5 - pmaddwd m3, [r5] - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m4, [r5 + 1 * mmsize] - paddd m2, m4 - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhwd xm4, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm4, 1 - pmaddwd m5, [r5 + 1 * mmsize] - paddd m3, m5 -%ifidn %1,sp - paddd m0, m7 - paddd m1, m7 - paddd m2, m7 - paddd m3, m7 - psrad m0, 12 - psrad m1, 12 - psrad m2, 12 - psrad m3, 12 -%else - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 -%endif - packssdw m0, m1 - packssdw m2, m3 -%ifidn %1,sp - packuswb m0, m2 - mova m3, [v4_interp8_hps_shuf] - vpermd m0, m3, m0 - vextracti128 xm2, m0, 1 -%else - vpermq m0, m0, 11011000b - vpermq m2, m2, 11011000b - vextracti128 xm1, m0, 1 - vextracti128 xm3, m2, 1 -%endif -%endmacro - -%macro FILTER_VER_CHROMA_S_AVX2_8x4 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x4, 4, 6, 8 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,sp - mova m7, [v4_pd_526336] -%else - add r3d, r3d -%endif - - PROCESS_CHROMA_S_AVX2_W8_4R %1 - lea r4, [r3 * 3] -%ifidn %1,sp - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r4], xm2 -%else - movu [r2], xm0 - movu [r2 + r3], xm1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r4], xm3 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_S_AVX2_8x4 sp - FILTER_VER_CHROMA_S_AVX2_8x4 ss - -%macro FILTER_VER_CHROMA_S_AVX2_12x16 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_12x16, 4, 9, 10 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,sp - mova m9, [v4_pd_526336] -%else - add r3d, r3d -%endif - lea r6, [r3 * 3] - PROCESS_CHROMA_S_AVX2_W8_16R %1 -%ifidn %1,sp - add r2, 8 -%else - add r2, 16 -%endif - add r0, 16 - mova m7, m9 - PROCESS_CHROMA_AVX2_W4_16R %1 - RET -%endif -%endmacro - - FILTER_VER_CHROMA_S_AVX2_12x16 sp - FILTER_VER_CHROMA_S_AVX2_12x16 ss - -%macro FILTER_VER_CHROMA_S_AVX2_12x32 1 -%if ARCH_X86_64 == 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_12x32, 4, 9, 10 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1, sp - mova m9, [v4_pd_526336] -%else - add r3d, r3d -%endif - lea r6, [r3 * 3] -%rep 2 - PROCESS_CHROMA_S_AVX2_W8_16R %1 -%ifidn %1, sp - add r2, 8 -%else - add r2, 16 -%endif - add r0, 16 - mova m7, m9 - PROCESS_CHROMA_AVX2_W4_16R %1 - sub r0, 16 -%ifidn %1, sp - lea r2, [r2 + r3 * 4 - 8] -%else - lea r2, [r2 + r3 * 4 - 16] -%endif -%endrep - RET -%endif -%endmacro - - FILTER_VER_CHROMA_S_AVX2_12x32 sp - FILTER_VER_CHROMA_S_AVX2_12x32 ss - -%macro FILTER_VER_CHROMA_S_AVX2_16x12 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_16x12, 4, 9, 9 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,sp - mova m8, [v4_pd_526336] -%else - add r3d, r3d -%endif - lea r6, [r3 * 3] -%rep 2 - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - paddd m0, m4 - pmaddwd m2, [r5] - lea r7, [r0 + r1 * 4] - movu xm4, [r7] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - paddd m1, m5 - pmaddwd m3, [r5] -%ifidn %1,sp - paddd m0, m8 - paddd m1, m8 - psrad m0, 12 - psrad m1, 12 -%else - psrad m0, 6 - psrad m1, 6 -%endif - packssdw m0, m1 - - movu xm5, [r7 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m2, m6 - pmaddwd m4, [r5] - movu xm6, [r7 + r1 * 2] ; m6 = row 6 - punpckhwd xm1, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm1, 1 - pmaddwd m1, m5, [r5 + 1 * mmsize] - pmaddwd m5, [r5] - paddd m3, m1 -%ifidn %1,sp - paddd m2, m8 - paddd m3, m8 - psrad m2, 12 - psrad m3, 12 -%else - psrad m2, 6 - psrad m3, 6 -%endif - packssdw m2, m3 -%ifidn %1,sp - packuswb m0, m2 - mova m3, [v4_interp8_hps_shuf] - vpermd m0, m3, m0 - vextracti128 xm2, m0, 1 - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r6], xm2 -%else - vpermq m0, m0, 11011000b - vpermq m2, m2, 11011000b - movu [r2], xm0 - vextracti128 xm0, m0, 1 - vextracti128 xm3, m2, 1 - movu [r2 + r3], xm0 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm3 -%endif - lea r8, [r2 + r3 * 4] - - movu xm1, [r7 + r4] ; m1 = row 7 - punpckhwd xm0, xm6, xm1 - punpcklwd xm6, xm1 - vinserti128 m6, m6, xm0, 1 - pmaddwd m0, m6, [r5 + 1 * mmsize] - pmaddwd m6, [r5] - paddd m4, m0 - lea r7, [r7 + r1 * 4] - movu xm0, [r7] ; m0 = row 8 - punpckhwd xm2, xm1, xm0 - punpcklwd xm1, xm0 - vinserti128 m1, m1, xm2, 1 - pmaddwd m2, m1, [r5 + 1 * mmsize] - pmaddwd m1, [r5] - paddd m5, m2 -%ifidn %1,sp - paddd m4, m8 - paddd m5, m8 - psrad m4, 12 - psrad m5, 12 -%else - psrad m4, 6 - psrad m5, 6 -%endif - packssdw m4, m5 - - movu xm2, [r7 + r1] ; m2 = row 9 - punpckhwd xm5, xm0, xm2 - punpcklwd xm0, xm2 - vinserti128 m0, m0, xm5, 1 - pmaddwd m5, m0, [r5 + 1 * mmsize] - paddd m6, m5 - pmaddwd m0, [r5] - movu xm5, [r7 + r1 * 2] ; m5 = row 10 - punpckhwd xm7, xm2, xm5 - punpcklwd xm2, xm5 - vinserti128 m2, m2, xm7, 1 - pmaddwd m7, m2, [r5 + 1 * mmsize] - paddd m1, m7 - pmaddwd m2, [r5] - -%ifidn %1,sp - paddd m6, m8 - paddd m1, m8 - psrad m6, 12 - psrad m1, 12 -%else - psrad m6, 6 - psrad m1, 6 -%endif - packssdw m6, m1 -%ifidn %1,sp - packuswb m4, m6 - vpermd m4, m3, m4 - vextracti128 xm6, m4, 1 - movq [r8], xm4 - movhps [r8 + r3], xm4 - movq [r8 + r3 * 2], xm6 - movhps [r8 + r6], xm6 -%else - vpermq m4, m4, 11011000b - vpermq m6, m6, 11011000b - vextracti128 xm7, m4, 1 - vextracti128 xm1, m6, 1 - movu [r8], xm4 - movu [r8 + r3], xm7 - movu [r8 + r3 * 2], xm6 - movu [r8 + r6], xm1 -%endif - lea r8, [r8 + r3 * 4] - - movu xm7, [r7 + r4] ; m7 = row 11 - punpckhwd xm1, xm5, xm7 - punpcklwd xm5, xm7 - vinserti128 m5, m5, xm1, 1 - pmaddwd m1, m5, [r5 + 1 * mmsize] - paddd m0, m1 - pmaddwd m5, [r5] - lea r7, [r7 + r1 * 4] - movu xm1, [r7] ; m1 = row 12 - punpckhwd xm4, xm7, xm1 - punpcklwd xm7, xm1 - vinserti128 m7, m7, xm4, 1 - pmaddwd m4, m7, [r5 + 1 * mmsize] - paddd m2, m4 - pmaddwd m7, [r5] -%ifidn %1,sp - paddd m0, m8 - paddd m2, m8 - psrad m0, 12 - psrad m2, 12 -%else - psrad m0, 6 - psrad m2, 6 -%endif - packssdw m0, m2 - - movu xm4, [r7 + r1] ; m4 = row 13 - punpckhwd xm2, xm1, xm4 - punpcklwd xm1, xm4 - vinserti128 m1, m1, xm2, 1 - pmaddwd m1, [r5 + 1 * mmsize] - paddd m5, m1 - movu xm2, [r7 + r1 * 2] ; m2 = row 14 - punpckhwd xm6, xm4, xm2 - punpcklwd xm4, xm2 - vinserti128 m4, m4, xm6, 1 - pmaddwd m4, [r5 + 1 * mmsize] - paddd m7, m4 -%ifidn %1,sp - paddd m5, m8 - paddd m7, m8 - psrad m5, 12 - psrad m7, 12 -%else - psrad m5, 6 - psrad m7, 6 -%endif - packssdw m5, m7 -%ifidn %1,sp - packuswb m0, m5 - vpermd m0, m3, m0 - vextracti128 xm5, m0, 1 - movq [r8], xm0 - movhps [r8 + r3], xm0 - movq [r8 + r3 * 2], xm5 - movhps [r8 + r6], xm5 - add r2, 8 -%else - vpermq m0, m0, 11011000b - vpermq m5, m5, 11011000b - vextracti128 xm7, m0, 1 - vextracti128 xm6, m5, 1 - movu [r8], xm0 - movu [r8 + r3], xm7 - movu [r8 + r3 * 2], xm5 - movu [r8 + r6], xm6 - add r2, 16 -%endif - add r0, 16 -%endrep - RET -%endif -%endmacro - - FILTER_VER_CHROMA_S_AVX2_16x12 sp - FILTER_VER_CHROMA_S_AVX2_16x12 ss - -%macro FILTER_VER_CHROMA_S_AVX2_8x12 1 -%if ARCH_X86_64 == 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x12, 4, 7, 9 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,sp - mova m8, [v4_pd_526336] -%else - add r3d, r3d -%endif - lea r6, [r3 * 3] - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - paddd m0, m4 - pmaddwd m2, [r5] - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - paddd m1, m5 - pmaddwd m3, [r5] -%ifidn %1,sp - paddd m0, m8 - paddd m1, m8 - psrad m0, 12 - psrad m1, 12 -%else - psrad m0, 6 - psrad m1, 6 -%endif - packssdw m0, m1 - - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m2, m6 - pmaddwd m4, [r5] - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhwd xm1, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm1, 1 - pmaddwd m1, m5, [r5 + 1 * mmsize] - pmaddwd m5, [r5] - paddd m3, m1 -%ifidn %1,sp - paddd m2, m8 - paddd m3, m8 - psrad m2, 12 - psrad m3, 12 -%else - psrad m2, 6 - psrad m3, 6 -%endif - packssdw m2, m3 -%ifidn %1,sp - packuswb m0, m2 - mova m3, [v4_interp8_hps_shuf] - vpermd m0, m3, m0 - vextracti128 xm2, m0, 1 - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r6], xm2 -%else - vpermq m0, m0, 11011000b - vpermq m2, m2, 11011000b - movu [r2], xm0 - vextracti128 xm0, m0, 1 - vextracti128 xm3, m2, 1 - movu [r2 + r3], xm0 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm3 -%endif - lea r2, [r2 + r3 * 4] - - movu xm1, [r0 + r4] ; m1 = row 7 - punpckhwd xm0, xm6, xm1 - punpcklwd xm6, xm1 - vinserti128 m6, m6, xm0, 1 - pmaddwd m0, m6, [r5 + 1 * mmsize] - pmaddwd m6, [r5] - paddd m4, m0 - lea r0, [r0 + r1 * 4] - movu xm0, [r0] ; m0 = row 8 - punpckhwd xm2, xm1, xm0 - punpcklwd xm1, xm0 - vinserti128 m1, m1, xm2, 1 - pmaddwd m2, m1, [r5 + 1 * mmsize] - pmaddwd m1, [r5] - paddd m5, m2 -%ifidn %1,sp - paddd m4, m8 - paddd m5, m8 - psrad m4, 12 - psrad m5, 12 -%else - psrad m4, 6 - psrad m5, 6 -%endif - packssdw m4, m5 - - movu xm2, [r0 + r1] ; m2 = row 9 - punpckhwd xm5, xm0, xm2 - punpcklwd xm0, xm2 - vinserti128 m0, m0, xm5, 1 - pmaddwd m5, m0, [r5 + 1 * mmsize] - paddd m6, m5 - pmaddwd m0, [r5] - movu xm5, [r0 + r1 * 2] ; m5 = row 10 - punpckhwd xm7, xm2, xm5 - punpcklwd xm2, xm5 - vinserti128 m2, m2, xm7, 1 - pmaddwd m7, m2, [r5 + 1 * mmsize] - paddd m1, m7 - pmaddwd m2, [r5] - -%ifidn %1,sp - paddd m6, m8 - paddd m1, m8 - psrad m6, 12 - psrad m1, 12 -%else - psrad m6, 6 - psrad m1, 6 -%endif - packssdw m6, m1 -%ifidn %1,sp - packuswb m4, m6 - vpermd m4, m3, m4 - vextracti128 xm6, m4, 1 - movq [r2], xm4 - movhps [r2 + r3], xm4 - movq [r2 + r3 * 2], xm6 - movhps [r2 + r6], xm6 -%else - vpermq m4, m4, 11011000b - vpermq m6, m6, 11011000b - vextracti128 xm7, m4, 1 - vextracti128 xm1, m6, 1 - movu [r2], xm4 - movu [r2 + r3], xm7 - movu [r2 + r3 * 2], xm6 - movu [r2 + r6], xm1 -%endif - lea r2, [r2 + r3 * 4] - - movu xm7, [r0 + r4] ; m7 = row 11 - punpckhwd xm1, xm5, xm7 - punpcklwd xm5, xm7 - vinserti128 m5, m5, xm1, 1 - pmaddwd m1, m5, [r5 + 1 * mmsize] - paddd m0, m1 - pmaddwd m5, [r5] - lea r0, [r0 + r1 * 4] - movu xm1, [r0] ; m1 = row 12 - punpckhwd xm4, xm7, xm1 - punpcklwd xm7, xm1 - vinserti128 m7, m7, xm4, 1 - pmaddwd m4, m7, [r5 + 1 * mmsize] - paddd m2, m4 - pmaddwd m7, [r5] -%ifidn %1,sp - paddd m0, m8 - paddd m2, m8 - psrad m0, 12 - psrad m2, 12 -%else - psrad m0, 6 - psrad m2, 6 -%endif - packssdw m0, m2 - - movu xm4, [r0 + r1] ; m4 = row 13 - punpckhwd xm2, xm1, xm4 - punpcklwd xm1, xm4 - vinserti128 m1, m1, xm2, 1 - pmaddwd m1, [r5 + 1 * mmsize] - paddd m5, m1 - movu xm2, [r0 + r1 * 2] ; m2 = row 14 - punpckhwd xm6, xm4, xm2 - punpcklwd xm4, xm2 - vinserti128 m4, m4, xm6, 1 - pmaddwd m4, [r5 + 1 * mmsize] - paddd m7, m4 -%ifidn %1,sp - paddd m5, m8 - paddd m7, m8 - psrad m5, 12 - psrad m7, 12 -%else - psrad m5, 6 - psrad m7, 6 -%endif - packssdw m5, m7 -%ifidn %1,sp - packuswb m0, m5 - vpermd m0, m3, m0 - vextracti128 xm5, m0, 1 - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm5 - movhps [r2 + r6], xm5 -%else - vpermq m0, m0, 11011000b - vpermq m5, m5, 11011000b - vextracti128 xm7, m0, 1 - vextracti128 xm6, m5, 1 - movu [r2], xm0 - movu [r2 + r3], xm7 - movu [r2 + r3 * 2], xm5 - movu [r2 + r6], xm6 -%endif - RET -%endif -%endmacro - - FILTER_VER_CHROMA_S_AVX2_8x12 sp - FILTER_VER_CHROMA_S_AVX2_8x12 ss - -%macro FILTER_VER_CHROMA_S_AVX2_16x4 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_16x4, 4, 7, 8 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,sp - mova m7, [v4_pd_526336] -%else - add r3d, r3d -%endif -%rep 2 - PROCESS_CHROMA_S_AVX2_W8_4R %1 - lea r6, [r3 * 3] -%ifidn %1,sp - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r6], xm2 - add r2, 8 -%else - movu [r2], xm0 - movu [r2 + r3], xm1 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm3 - add r2, 16 -%endif - lea r6, [4 * r1 - 16] - sub r0, r6 -%endrep - RET -%endmacro - - FILTER_VER_CHROMA_S_AVX2_16x4 sp - FILTER_VER_CHROMA_S_AVX2_16x4 ss - -%macro PROCESS_CHROMA_S_AVX2_W8_8R 1 - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - paddd m0, m4 - pmaddwd m2, [r5] - lea r7, [r0 + r1 * 4] - movu xm4, [r7] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - paddd m1, m5 - pmaddwd m3, [r5] -%ifidn %1,sp - paddd m0, m7 - paddd m1, m7 - psrad m0, 12 - psrad m1, 12 -%else - psrad m0, 6 - psrad m1, 6 -%endif - packssdw m0, m1 - - movu xm5, [r7 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m2, m6 - pmaddwd m4, [r5] - movu xm6, [r7 + r1 * 2] ; m6 = row 6 - punpckhwd xm1, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm1, 1 - pmaddwd m1, m5, [r5 + 1 * mmsize] - pmaddwd m5, [r5] - paddd m3, m1 -%ifidn %1,sp - paddd m2, m7 - paddd m3, m7 - psrad m2, 12 - psrad m3, 12 -%else - psrad m2, 6 - psrad m3, 6 -%endif - packssdw m2, m3 -%ifidn %1,sp - packuswb m0, m2 - mova m3, [v4_interp8_hps_shuf] - vpermd m0, m3, m0 - vextracti128 xm2, m0, 1 - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r6], xm2 -%else - vpermq m0, m0, 11011000b - vpermq m2, m2, 11011000b - movu [r2], xm0 - vextracti128 xm0, m0, 1 - vextracti128 xm3, m2, 1 - movu [r2 + r3], xm0 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm3 -%endif - lea r8, [r2 + r3 * 4] - - movu xm1, [r7 + r4] ; m1 = row 7 - punpckhwd xm0, xm6, xm1 - punpcklwd xm6, xm1 - vinserti128 m6, m6, xm0, 1 - pmaddwd m0, m6, [r5 + 1 * mmsize] - pmaddwd m6, [r5] - paddd m4, m0 - lea r7, [r7 + r1 * 4] - movu xm0, [r7] ; m0 = row 8 - punpckhwd xm2, xm1, xm0 - punpcklwd xm1, xm0 - vinserti128 m1, m1, xm2, 1 - pmaddwd m2, m1, [r5 + 1 * mmsize] - pmaddwd m1, [r5] - paddd m5, m2 -%ifidn %1,sp - paddd m4, m7 - paddd m5, m7 - psrad m4, 12 - psrad m5, 12 -%else - psrad m4, 6 - psrad m5, 6 -%endif - packssdw m4, m5 - - movu xm2, [r7 + r1] ; m2 = row 9 - punpckhwd xm5, xm0, xm2 - punpcklwd xm0, xm2 - vinserti128 m0, m0, xm5, 1 - pmaddwd m0, [r5 + 1 * mmsize] - paddd m6, m0 - movu xm5, [r7 + r1 * 2] ; m5 = row 10 - punpckhwd xm0, xm2, xm5 - punpcklwd xm2, xm5 - vinserti128 m2, m2, xm0, 1 - pmaddwd m2, [r5 + 1 * mmsize] - paddd m1, m2 - -%ifidn %1,sp - paddd m6, m7 - paddd m1, m7 - psrad m6, 12 - psrad m1, 12 -%else - psrad m6, 6 - psrad m1, 6 -%endif - packssdw m6, m1 -%ifidn %1,sp - packuswb m4, m6 - vpermd m4, m3, m4 - vextracti128 xm6, m4, 1 - movq [r8], xm4 - movhps [r8 + r3], xm4 - movq [r8 + r3 * 2], xm6 - movhps [r8 + r6], xm6 -%else - vpermq m4, m4, 11011000b - vpermq m6, m6, 11011000b - vextracti128 xm7, m4, 1 - vextracti128 xm1, m6, 1 - movu [r8], xm4 - movu [r8 + r3], xm7 - movu [r8 + r3 * 2], xm6 - movu [r8 + r6], xm1 -%endif -%endmacro - -%macro FILTER_VER_CHROMA_S_AVX2_Nx8 2 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_%2x8, 4, 9, 8 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,sp - mova m7, [v4_pd_526336] -%else - add r3d, r3d -%endif - lea r6, [r3 * 3] -%rep %2 / 8 - PROCESS_CHROMA_S_AVX2_W8_8R %1 -%ifidn %1,sp - add r2, 8 -%else - add r2, 16 -%endif - add r0, 16 -%endrep - RET -%endif -%endmacro - - FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 32 - FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 16 - FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 32 - FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 16 - -%macro FILTER_VER_CHROMA_S_AVX2_8x2 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x2, 4, 6, 6 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,sp - mova m5, [v4_pd_526336] -%else - add r3d, r3d -%endif - - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m2, [r5 + 1 * mmsize] - paddd m0, m2 - movu xm4, [r0 + r1 * 4] ; m4 = row 4 - punpckhwd xm2, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm2, 1 - pmaddwd m3, [r5 + 1 * mmsize] - paddd m1, m3 -%ifidn %1,sp - paddd m0, m5 - paddd m1, m5 - psrad m0, 12 - psrad m1, 12 -%else - psrad m0, 6 - psrad m1, 6 -%endif - packssdw m0, m1 -%ifidn %1,sp - vextracti128 xm1, m0, 1 - packuswb xm0, xm1 - pshufd xm0, xm0, 11011000b - movq [r2], xm0 - movhps [r2 + r3], xm0 -%else - vpermq m0, m0, 11011000b - vextracti128 xm1, m0, 1 - movu [r2], xm0 - movu [r2 + r3], xm1 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_S_AVX2_8x2 sp - FILTER_VER_CHROMA_S_AVX2_8x2 ss - -%macro FILTER_VER_CHROMA_S_AVX2_8x6 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x6, 4, 6, 8 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,sp - mova m7, [v4_pd_526336] -%else - add r3d, r3d -%endif - - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - pmaddwd m2, [r5] - paddd m0, m4 - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - pmaddwd m3, [r5] - paddd m1, m5 -%ifidn %1,sp - paddd m0, m7 - paddd m1, m7 - psrad m0, 12 - psrad m1, 12 -%else - psrad m0, 6 - psrad m1, 6 -%endif - packssdw m0, m1 - - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m2, m6 - pmaddwd m4, [r5] - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhwd xm1, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm1, 1 - pmaddwd m1, m5, [r5 + 1 * mmsize] - pmaddwd m5, [r5] - paddd m3, m1 -%ifidn %1,sp - paddd m2, m7 - paddd m3, m7 - psrad m2, 12 - psrad m3, 12 -%else - psrad m2, 6 - psrad m3, 6 -%endif - packssdw m2, m3 - - movu xm1, [r0 + r4] ; m1 = row 7 - punpckhwd xm3, xm6, xm1 - punpcklwd xm6, xm1 - vinserti128 m6, m6, xm3, 1 - pmaddwd m6, [r5 + 1 * mmsize] - paddd m4, m6 - movu xm6, [r0 + r1 * 4] ; m6 = row 8 - punpckhwd xm3, xm1, xm6 - punpcklwd xm1, xm6 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5 + 1 * mmsize] - paddd m5, m1 -%ifidn %1,sp - paddd m4, m7 - paddd m5, m7 - psrad m4, 12 - psrad m5, 12 -%else - psrad m4, 6 - psrad m5, 6 -%endif - packssdw m4, m5 - lea r4, [r3 * 3] -%ifidn %1,sp - packuswb m0, m2 - mova m3, [v4_interp8_hps_shuf] - vpermd m0, m3, m0 - vextracti128 xm2, m0, 1 - vextracti128 xm5, m4, 1 - packuswb xm4, xm5 - pshufd xm4, xm4, 11011000b - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r4], xm2 - lea r2, [r2 + r3 * 4] - movq [r2], xm4 - movhps [r2 + r3], xm4 -%else - vpermq m0, m0, 11011000b - vpermq m2, m2, 11011000b - vpermq m4, m4, 11011000b - movu [r2], xm0 - vextracti128 xm0, m0, 1 - vextracti128 xm3, m2, 1 - vextracti128 xm5, m4, 1 - movu [r2 + r3], xm0 - movu [r2 + r3 * 2], xm2 - movu [r2 + r4], xm3 - lea r2, [r2 + r3 * 4] - movu [r2], xm4 - movu [r2 + r3], xm5 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_S_AVX2_8x6 sp - FILTER_VER_CHROMA_S_AVX2_8x6 ss - -%macro FILTER_VER_CHROMA_S_AVX2_8xN 2 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_8x%2, 4, 7, 9 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,sp - mova m8, [v4_pd_526336] -%else - add r3d, r3d -%endif - lea r6, [r3 * 3] -%rep %2 / 16 - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - paddd m0, m4 - pmaddwd m2, [r5] - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - paddd m1, m5 - pmaddwd m3, [r5] -%ifidn %1,sp - paddd m0, m8 - paddd m1, m8 - psrad m0, 12 - psrad m1, 12 -%else - psrad m0, 6 - psrad m1, 6 -%endif - packssdw m0, m1 - - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m2, m6 - pmaddwd m4, [r5] - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhwd xm1, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm1, 1 - pmaddwd m1, m5, [r5 + 1 * mmsize] - pmaddwd m5, [r5] - paddd m3, m1 -%ifidn %1,sp - paddd m2, m8 - paddd m3, m8 - psrad m2, 12 - psrad m3, 12 -%else - psrad m2, 6 - psrad m3, 6 -%endif - packssdw m2, m3 -%ifidn %1,sp - packuswb m0, m2 - mova m3, [v4_interp8_hps_shuf] - vpermd m0, m3, m0 - vextracti128 xm2, m0, 1 - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r6], xm2 -%else - vpermq m0, m0, 11011000b - vpermq m2, m2, 11011000b - movu [r2], xm0 - vextracti128 xm0, m0, 1 - vextracti128 xm3, m2, 1 - movu [r2 + r3], xm0 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm3 -%endif - lea r2, [r2 + r3 * 4] - - movu xm1, [r0 + r4] ; m1 = row 7 - punpckhwd xm0, xm6, xm1 - punpcklwd xm6, xm1 - vinserti128 m6, m6, xm0, 1 - pmaddwd m0, m6, [r5 + 1 * mmsize] - pmaddwd m6, [r5] - paddd m4, m0 - lea r0, [r0 + r1 * 4] - movu xm0, [r0] ; m0 = row 8 - punpckhwd xm2, xm1, xm0 - punpcklwd xm1, xm0 - vinserti128 m1, m1, xm2, 1 - pmaddwd m2, m1, [r5 + 1 * mmsize] - pmaddwd m1, [r5] - paddd m5, m2 -%ifidn %1,sp - paddd m4, m8 - paddd m5, m8 - psrad m4, 12 - psrad m5, 12 -%else - psrad m4, 6 - psrad m5, 6 -%endif - packssdw m4, m5 - - movu xm2, [r0 + r1] ; m2 = row 9 - punpckhwd xm5, xm0, xm2 - punpcklwd xm0, xm2 - vinserti128 m0, m0, xm5, 1 - pmaddwd m5, m0, [r5 + 1 * mmsize] - paddd m6, m5 - pmaddwd m0, [r5] - movu xm5, [r0 + r1 * 2] ; m5 = row 10 - punpckhwd xm7, xm2, xm5 - punpcklwd xm2, xm5 - vinserti128 m2, m2, xm7, 1 - pmaddwd m7, m2, [r5 + 1 * mmsize] - paddd m1, m7 - pmaddwd m2, [r5] - -%ifidn %1,sp - paddd m6, m8 - paddd m1, m8 - psrad m6, 12 - psrad m1, 12 -%else - psrad m6, 6 - psrad m1, 6 -%endif - packssdw m6, m1 -%ifidn %1,sp - packuswb m4, m6 - vpermd m4, m3, m4 - vextracti128 xm6, m4, 1 - movq [r2], xm4 - movhps [r2 + r3], xm4 - movq [r2 + r3 * 2], xm6 - movhps [r2 + r6], xm6 -%else - vpermq m4, m4, 11011000b - vpermq m6, m6, 11011000b - vextracti128 xm7, m4, 1 - vextracti128 xm1, m6, 1 - movu [r2], xm4 - movu [r2 + r3], xm7 - movu [r2 + r3 * 2], xm6 - movu [r2 + r6], xm1 -%endif - lea r2, [r2 + r3 * 4] - - movu xm7, [r0 + r4] ; m7 = row 11 - punpckhwd xm1, xm5, xm7 - punpcklwd xm5, xm7 - vinserti128 m5, m5, xm1, 1 - pmaddwd m1, m5, [r5 + 1 * mmsize] - paddd m0, m1 - pmaddwd m5, [r5] - lea r0, [r0 + r1 * 4] - movu xm1, [r0] ; m1 = row 12 - punpckhwd xm4, xm7, xm1 - punpcklwd xm7, xm1 - vinserti128 m7, m7, xm4, 1 - pmaddwd m4, m7, [r5 + 1 * mmsize] - paddd m2, m4 - pmaddwd m7, [r5] -%ifidn %1,sp - paddd m0, m8 - paddd m2, m8 - psrad m0, 12 - psrad m2, 12 -%else - psrad m0, 6 - psrad m2, 6 -%endif - packssdw m0, m2 - - movu xm4, [r0 + r1] ; m4 = row 13 - punpckhwd xm2, xm1, xm4 - punpcklwd xm1, xm4 - vinserti128 m1, m1, xm2, 1 - pmaddwd m2, m1, [r5 + 1 * mmsize] - paddd m5, m2 - pmaddwd m1, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 14 - punpckhwd xm6, xm4, xm2 - punpcklwd xm4, xm2 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m7, m6 - pmaddwd m4, [r5] -%ifidn %1,sp - paddd m5, m8 - paddd m7, m8 - psrad m5, 12 - psrad m7, 12 -%else - psrad m5, 6 - psrad m7, 6 -%endif - packssdw m5, m7 -%ifidn %1,sp - packuswb m0, m5 - vpermd m0, m3, m0 - vextracti128 xm5, m0, 1 - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm5 - movhps [r2 + r6], xm5 -%else - vpermq m0, m0, 11011000b - vpermq m5, m5, 11011000b - vextracti128 xm7, m0, 1 - vextracti128 xm6, m5, 1 - movu [r2], xm0 - movu [r2 + r3], xm7 - movu [r2 + r3 * 2], xm5 - movu [r2 + r6], xm6 -%endif - lea r2, [r2 + r3 * 4] - - movu xm6, [r0 + r4] ; m6 = row 15 - punpckhwd xm5, xm2, xm6 - punpcklwd xm2, xm6 - vinserti128 m2, m2, xm5, 1 - pmaddwd m5, m2, [r5 + 1 * mmsize] - paddd m1, m5 - pmaddwd m2, [r5] - lea r0, [r0 + r1 * 4] - movu xm0, [r0] ; m0 = row 16 - punpckhwd xm5, xm6, xm0 - punpcklwd xm6, xm0 - vinserti128 m6, m6, xm5, 1 - pmaddwd m5, m6, [r5 + 1 * mmsize] - paddd m4, m5 - pmaddwd m6, [r5] -%ifidn %1,sp - paddd m1, m8 - paddd m4, m8 - psrad m1, 12 - psrad m4, 12 -%else - psrad m1, 6 - psrad m4, 6 -%endif - packssdw m1, m4 - - movu xm5, [r0 + r1] ; m5 = row 17 - punpckhwd xm4, xm0, xm5 - punpcklwd xm0, xm5 - vinserti128 m0, m0, xm4, 1 - pmaddwd m0, [r5 + 1 * mmsize] - paddd m2, m0 - movu xm4, [r0 + r1 * 2] ; m4 = row 18 - punpckhwd xm0, xm5, xm4 - punpcklwd xm5, xm4 - vinserti128 m5, m5, xm0, 1 - pmaddwd m5, [r5 + 1 * mmsize] - paddd m6, m5 -%ifidn %1,sp - paddd m2, m8 - paddd m6, m8 - psrad m2, 12 - psrad m6, 12 -%else - psrad m2, 6 - psrad m6, 6 -%endif - packssdw m2, m6 -%ifidn %1,sp - packuswb m1, m2 - vpermd m1, m3, m1 - vextracti128 xm2, m1, 1 - movq [r2], xm1 - movhps [r2 + r3], xm1 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r6], xm2 -%else - vpermq m1, m1, 11011000b - vpermq m2, m2, 11011000b - vextracti128 xm6, m1, 1 - vextracti128 xm4, m2, 1 - movu [r2], xm1 - movu [r2 + r3], xm6 - movu [r2 + r3 * 2], xm2 - movu [r2 + r6], xm4 -%endif - lea r2, [r2 + r3 * 4] -%endrep - RET -%endif -%endmacro - - FILTER_VER_CHROMA_S_AVX2_8xN sp, 16 - FILTER_VER_CHROMA_S_AVX2_8xN sp, 32 - FILTER_VER_CHROMA_S_AVX2_8xN sp, 64 - FILTER_VER_CHROMA_S_AVX2_8xN ss, 16 - FILTER_VER_CHROMA_S_AVX2_8xN ss, 32 - FILTER_VER_CHROMA_S_AVX2_8xN ss, 64 - -%macro FILTER_VER_CHROMA_S_AVX2_Nx24 2 -%if ARCH_X86_64 == 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_%2x24, 4, 10, 10 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,sp - mova m9, [v4_pd_526336] -%else - add r3d, r3d -%endif - lea r6, [r3 * 3] - mov r9d, %2 / 8 -.loopW: - PROCESS_CHROMA_S_AVX2_W8_16R %1 -%ifidn %1,sp - add r2, 8 -%else - add r2, 16 -%endif - add r0, 16 - dec r9d - jnz .loopW -%ifidn %1,sp - lea r2, [r8 + r3 * 4 - %2 + 8] -%else - lea r2, [r8 + r3 * 4 - 2 * %2 + 16] -%endif - lea r0, [r7 - 2 * %2 + 16] - mova m7, m9 - mov r9d, %2 / 8 -.loop: - PROCESS_CHROMA_S_AVX2_W8_8R %1 -%ifidn %1,sp - add r2, 8 -%else - add r2, 16 -%endif - add r0, 16 - dec r9d - jnz .loop - RET -%endif -%endmacro - - FILTER_VER_CHROMA_S_AVX2_Nx24 sp, 32 - FILTER_VER_CHROMA_S_AVX2_Nx24 sp, 16 - FILTER_VER_CHROMA_S_AVX2_Nx24 ss, 32 - FILTER_VER_CHROMA_S_AVX2_Nx24 ss, 16 - -%macro FILTER_VER_CHROMA_S_AVX2_2x8 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_2x8, 4, 6, 7 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - sub r0, r1 - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] -%ifidn %1,sp - mova m6, [v4_pd_526336] -%else - add r3d, r3d -%endif - movd xm0, [r0] - movd xm1, [r0 + r1] - punpcklwd xm0, xm1 - movd xm2, [r0 + r1 * 2] - punpcklwd xm1, xm2 - punpcklqdq xm0, xm1 ; m0 = [2 1 1 0] - movd xm3, [r0 + r4] - punpcklwd xm2, xm3 - lea r0, [r0 + 4 * r1] - movd xm4, [r0] - punpcklwd xm3, xm4 - punpcklqdq xm2, xm3 ; m2 = [4 3 3 2] - vinserti128 m0, m0, xm2, 1 ; m0 = [4 3 3 2 2 1 1 0] - movd xm1, [r0 + r1] - punpcklwd xm4, xm1 - movd xm3, [r0 + r1 * 2] - punpcklwd xm1, xm3 - punpcklqdq xm4, xm1 ; m4 = [6 5 5 4] - vinserti128 m2, m2, xm4, 1 ; m2 = [6 5 5 4 4 3 3 2] - pmaddwd m0, [r5] - pmaddwd m2, [r5 + 1 * mmsize] - paddd m0, m2 - movd xm1, [r0 + r4] - punpcklwd xm3, xm1 - lea r0, [r0 + 4 * r1] - movd xm2, [r0] - punpcklwd xm1, xm2 - punpcklqdq xm3, xm1 ; m3 = [8 7 7 6] - vinserti128 m4, m4, xm3, 1 ; m4 = [8 7 7 6 6 5 5 4] - movd xm1, [r0 + r1] - punpcklwd xm2, xm1 - movd xm5, [r0 + r1 * 2] - punpcklwd xm1, xm5 - punpcklqdq xm2, xm1 ; m2 = [10 9 9 8] - vinserti128 m3, m3, xm2, 1 ; m3 = [10 9 9 8 8 7 7 6] - pmaddwd m4, [r5] - pmaddwd m3, [r5 + 1 * mmsize] - paddd m4, m3 -%ifidn %1,sp - paddd m0, m6 - paddd m4, m6 - psrad m0, 12 - psrad m4, 12 -%else - psrad m0, 6 - psrad m4, 6 -%endif - packssdw m0, m4 - vextracti128 xm4, m0, 1 - lea r4, [r3 * 3] -%ifidn %1,sp - packuswb xm0, xm4 - pextrw [r2], xm0, 0 - pextrw [r2 + r3], xm0, 1 - pextrw [r2 + 2 * r3], xm0, 4 - pextrw [r2 + r4], xm0, 5 - lea r2, [r2 + r3 * 4] - pextrw [r2], xm0, 2 - pextrw [r2 + r3], xm0, 3 - pextrw [r2 + 2 * r3], xm0, 6 - pextrw [r2 + r4], xm0, 7 -%else - movd [r2], xm0 - pextrd [r2 + r3], xm0, 1 - movd [r2 + 2 * r3], xm4 - pextrd [r2 + r4], xm4, 1 - lea r2, [r2 + r3 * 4] - pextrd [r2], xm0, 2 - pextrd [r2 + r3], xm0, 3 - pextrd [r2 + 2 * r3], xm4, 2 - pextrd [r2 + r4], xm4, 3 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_S_AVX2_2x8 sp - FILTER_VER_CHROMA_S_AVX2_2x8 ss - -%macro FILTER_VER_CHROMA_S_AVX2_2x16 1 -%if ARCH_X86_64 == 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_2x16, 4, 6, 9 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - sub r0, r1 - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] -%ifidn %1,sp - mova m6, [v4_pd_526336] -%else - add r3d, r3d -%endif - movd xm0, [r0] - movd xm1, [r0 + r1] - punpcklwd xm0, xm1 - movd xm2, [r0 + r1 * 2] - punpcklwd xm1, xm2 - punpcklqdq xm0, xm1 ; m0 = [2 1 1 0] - movd xm3, [r0 + r4] - punpcklwd xm2, xm3 - lea r0, [r0 + 4 * r1] - movd xm4, [r0] - punpcklwd xm3, xm4 - punpcklqdq xm2, xm3 ; m2 = [4 3 3 2] - vinserti128 m0, m0, xm2, 1 ; m0 = [4 3 3 2 2 1 1 0] - movd xm1, [r0 + r1] - punpcklwd xm4, xm1 - movd xm3, [r0 + r1 * 2] - punpcklwd xm1, xm3 - punpcklqdq xm4, xm1 ; m4 = [6 5 5 4] - vinserti128 m2, m2, xm4, 1 ; m2 = [6 5 5 4 4 3 3 2] - pmaddwd m0, [r5] - pmaddwd m2, [r5 + 1 * mmsize] - paddd m0, m2 - movd xm1, [r0 + r4] - punpcklwd xm3, xm1 - lea r0, [r0 + 4 * r1] - movd xm2, [r0] - punpcklwd xm1, xm2 - punpcklqdq xm3, xm1 ; m3 = [8 7 7 6] - vinserti128 m4, m4, xm3, 1 ; m4 = [8 7 7 6 6 5 5 4] - movd xm1, [r0 + r1] - punpcklwd xm2, xm1 - movd xm5, [r0 + r1 * 2] - punpcklwd xm1, xm5 - punpcklqdq xm2, xm1 ; m2 = [10 9 9 8] - vinserti128 m3, m3, xm2, 1 ; m3 = [10 9 9 8 8 7 7 6] - pmaddwd m4, [r5] - pmaddwd m3, [r5 + 1 * mmsize] - paddd m4, m3 - movd xm1, [r0 + r4] - punpcklwd xm5, xm1 - lea r0, [r0 + 4 * r1] - movd xm3, [r0] - punpcklwd xm1, xm3 - punpcklqdq xm5, xm1 ; m5 = [12 11 11 10] - vinserti128 m2, m2, xm5, 1 ; m2 = [12 11 11 10 10 9 9 8] - movd xm1, [r0 + r1] - punpcklwd xm3, xm1 - movd xm7, [r0 + r1 * 2] - punpcklwd xm1, xm7 - punpcklqdq xm3, xm1 ; m3 = [14 13 13 12] - vinserti128 m5, m5, xm3, 1 ; m5 = [14 13 13 12 12 11 11 10] - pmaddwd m2, [r5] - pmaddwd m5, [r5 + 1 * mmsize] - paddd m2, m5 - movd xm5, [r0 + r4] - punpcklwd xm7, xm5 - lea r0, [r0 + 4 * r1] - movd xm1, [r0] - punpcklwd xm5, xm1 - punpcklqdq xm7, xm5 ; m7 = [16 15 15 14] - vinserti128 m3, m3, xm7, 1 ; m3 = [16 15 15 14 14 13 13 12] - movd xm5, [r0 + r1] - punpcklwd xm1, xm5 - movd xm8, [r0 + r1 * 2] - punpcklwd xm5, xm8 - punpcklqdq xm1, xm5 ; m1 = [18 17 17 16] - vinserti128 m7, m7, xm1, 1 ; m7 = [18 17 17 16 16 15 15 14] - pmaddwd m3, [r5] - pmaddwd m7, [r5 + 1 * mmsize] - paddd m3, m7 -%ifidn %1,sp - paddd m0, m6 - paddd m4, m6 - paddd m2, m6 - paddd m3, m6 - psrad m0, 12 - psrad m4, 12 - psrad m2, 12 - psrad m3, 12 -%else - psrad m0, 6 - psrad m4, 6 - psrad m2, 6 - psrad m3, 6 -%endif - packssdw m0, m4 - packssdw m2, m3 - lea r4, [r3 * 3] -%ifidn %1,sp - packuswb m0, m2 - vextracti128 xm2, m0, 1 - pextrw [r2], xm0, 0 - pextrw [r2 + r3], xm0, 1 - pextrw [r2 + 2 * r3], xm2, 0 - pextrw [r2 + r4], xm2, 1 - lea r2, [r2 + r3 * 4] - pextrw [r2], xm0, 2 - pextrw [r2 + r3], xm0, 3 - pextrw [r2 + 2 * r3], xm2, 2 - pextrw [r2 + r4], xm2, 3 - lea r2, [r2 + r3 * 4] - pextrw [r2], xm0, 4 - pextrw [r2 + r3], xm0, 5 - pextrw [r2 + 2 * r3], xm2, 4 - pextrw [r2 + r4], xm2, 5 - lea r2, [r2 + r3 * 4] - pextrw [r2], xm0, 6 - pextrw [r2 + r3], xm0, 7 - pextrw [r2 + 2 * r3], xm2, 6 - pextrw [r2 + r4], xm2, 7 -%else - vextracti128 xm4, m0, 1 - vextracti128 xm3, m2, 1 - movd [r2], xm0 - pextrd [r2 + r3], xm0, 1 - movd [r2 + 2 * r3], xm4 - pextrd [r2 + r4], xm4, 1 - lea r2, [r2 + r3 * 4] - pextrd [r2], xm0, 2 - pextrd [r2 + r3], xm0, 3 - pextrd [r2 + 2 * r3], xm4, 2 - pextrd [r2 + r4], xm4, 3 - lea r2, [r2 + r3 * 4] - movd [r2], xm2 - pextrd [r2 + r3], xm2, 1 - movd [r2 + 2 * r3], xm3 - pextrd [r2 + r4], xm3, 1 - lea r2, [r2 + r3 * 4] - pextrd [r2], xm2, 2 - pextrd [r2 + r3], xm2, 3 - pextrd [r2 + 2 * r3], xm3, 2 - pextrd [r2 + r4], xm3, 3 -%endif - RET -%endif -%endmacro - - FILTER_VER_CHROMA_S_AVX2_2x16 sp - FILTER_VER_CHROMA_S_AVX2_2x16 ss - -%macro FILTER_VER_CHROMA_S_AVX2_6x8 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_6x8, 4, 6, 8 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,sp - mova m7, [v4_pd_526336] -%else - add r3d, r3d -%endif - - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - pmaddwd m2, [r5] - paddd m0, m4 - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - pmaddwd m3, [r5] - paddd m1, m5 -%ifidn %1,sp - paddd m0, m7 - paddd m1, m7 - psrad m0, 12 - psrad m1, 12 -%else - psrad m0, 6 - psrad m1, 6 -%endif - packssdw m0, m1 - - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m2, m6 - pmaddwd m4, [r5] - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhwd xm1, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm1, 1 - pmaddwd m1, m5, [r5 + 1 * mmsize] - pmaddwd m5, [r5] - paddd m3, m1 -%ifidn %1,sp - paddd m2, m7 - paddd m3, m7 - psrad m2, 12 - psrad m3, 12 -%else - psrad m2, 6 - psrad m3, 6 -%endif - packssdw m2, m3 - - movu xm1, [r0 + r4] ; m1 = row 7 - punpckhwd xm3, xm6, xm1 - punpcklwd xm6, xm1 - vinserti128 m6, m6, xm3, 1 - pmaddwd m3, m6, [r5 + 1 * mmsize] - pmaddwd m6, [r5] - paddd m4, m3 - - lea r4, [r3 * 3] -%ifidn %1,sp - packuswb m0, m2 - vextracti128 xm2, m0, 1 - movd [r2], xm0 - pextrw [r2 + 4], xm2, 0 - pextrd [r2 + r3], xm0, 1 - pextrw [r2 + r3 + 4], xm2, 2 - pextrd [r2 + r3 * 2], xm0, 2 - pextrw [r2 + r3 * 2 + 4], xm2, 4 - pextrd [r2 + r4], xm0, 3 - pextrw [r2 + r4 + 4], xm2, 6 -%else - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r4], xm2 - vextracti128 xm0, m0, 1 - vextracti128 xm3, m2, 1 - movd [r2 + 8], xm0 - pextrd [r2 + r3 + 8], xm0, 2 - movd [r2 + r3 * 2 + 8], xm3 - pextrd [r2 + r4 + 8], xm3, 2 -%endif - lea r2, [r2 + r3 * 4] - lea r0, [r0 + r1 * 4] - movu xm0, [r0] ; m0 = row 8 - punpckhwd xm2, xm1, xm0 - punpcklwd xm1, xm0 - vinserti128 m1, m1, xm2, 1 - pmaddwd m2, m1, [r5 + 1 * mmsize] - pmaddwd m1, [r5] - paddd m5, m2 -%ifidn %1,sp - paddd m4, m7 - paddd m5, m7 - psrad m4, 12 - psrad m5, 12 -%else - psrad m4, 6 - psrad m5, 6 -%endif - packssdw m4, m5 - - movu xm2, [r0 + r1] ; m2 = row 9 - punpckhwd xm5, xm0, xm2 - punpcklwd xm0, xm2 - vinserti128 m0, m0, xm5, 1 - pmaddwd m0, [r5 + 1 * mmsize] - paddd m6, m0 - movu xm5, [r0 + r1 * 2] ; m5 = row 10 - punpckhwd xm0, xm2, xm5 - punpcklwd xm2, xm5 - vinserti128 m2, m2, xm0, 1 - pmaddwd m2, [r5 + 1 * mmsize] - paddd m1, m2 - -%ifidn %1,sp - paddd m6, m7 - paddd m1, m7 - psrad m6, 12 - psrad m1, 12 -%else - psrad m6, 6 - psrad m1, 6 -%endif - packssdw m6, m1 -%ifidn %1,sp - packuswb m4, m6 - vextracti128 xm6, m4, 1 - movd [r2], xm4 - pextrw [r2 + 4], xm6, 0 - pextrd [r2 + r3], xm4, 1 - pextrw [r2 + r3 + 4], xm6, 2 - pextrd [r2 + r3 * 2], xm4, 2 - pextrw [r2 + r3 * 2 + 4], xm6, 4 - pextrd [r2 + r4], xm4, 3 - pextrw [r2 + r4 + 4], xm6, 6 -%else - movq [r2], xm4 - movhps [r2 + r3], xm4 - movq [r2 + r3 * 2], xm6 - movhps [r2 + r4], xm6 - vextracti128 xm5, m4, 1 - vextracti128 xm1, m6, 1 - movd [r2 + 8], xm5 - pextrd [r2 + r3 + 8], xm5, 2 - movd [r2 + r3 * 2 + 8], xm1 - pextrd [r2 + r4 + 8], xm1, 2 -%endif - RET -%endmacro - - FILTER_VER_CHROMA_S_AVX2_6x8 sp - FILTER_VER_CHROMA_S_AVX2_6x8 ss - -%macro FILTER_VER_CHROMA_S_AVX2_6x16 1 -%if ARCH_X86_64 == 1 -INIT_YMM avx2 -cglobal interp_4tap_vert_%1_6x16, 4, 7, 9 - mov r4d, r4m - shl r4d, 6 - add r1d, r1d - -%ifdef PIC - lea r5, [pw_ChromaCoeffV] - add r5, r4 -%else - lea r5, [pw_ChromaCoeffV + r4] -%endif - - lea r4, [r1 * 3] - sub r0, r1 -%ifidn %1,sp - mova m8, [v4_pd_526336] -%else - add r3d, r3d -%endif - lea r6, [r3 * 3] - movu xm0, [r0] ; m0 = row 0 - movu xm1, [r0 + r1] ; m1 = row 1 - punpckhwd xm2, xm0, xm1 - punpcklwd xm0, xm1 - vinserti128 m0, m0, xm2, 1 - pmaddwd m0, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 2 - punpckhwd xm3, xm1, xm2 - punpcklwd xm1, xm2 - vinserti128 m1, m1, xm3, 1 - pmaddwd m1, [r5] - movu xm3, [r0 + r4] ; m3 = row 3 - punpckhwd xm4, xm2, xm3 - punpcklwd xm2, xm3 - vinserti128 m2, m2, xm4, 1 - pmaddwd m4, m2, [r5 + 1 * mmsize] - paddd m0, m4 - pmaddwd m2, [r5] - lea r0, [r0 + r1 * 4] - movu xm4, [r0] ; m4 = row 4 - punpckhwd xm5, xm3, xm4 - punpcklwd xm3, xm4 - vinserti128 m3, m3, xm5, 1 - pmaddwd m5, m3, [r5 + 1 * mmsize] - paddd m1, m5 - pmaddwd m3, [r5] -%ifidn %1,sp - paddd m0, m8 - paddd m1, m8 - psrad m0, 12 - psrad m1, 12 -%else - psrad m0, 6 - psrad m1, 6 -%endif - packssdw m0, m1 - - movu xm5, [r0 + r1] ; m5 = row 5 - punpckhwd xm6, xm4, xm5 - punpcklwd xm4, xm5 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m2, m6 - pmaddwd m4, [r5] - movu xm6, [r0 + r1 * 2] ; m6 = row 6 - punpckhwd xm1, xm5, xm6 - punpcklwd xm5, xm6 - vinserti128 m5, m5, xm1, 1 - pmaddwd m1, m5, [r5 + 1 * mmsize] - pmaddwd m5, [r5] - paddd m3, m1 -%ifidn %1,sp - paddd m2, m8 - paddd m3, m8 - psrad m2, 12 - psrad m3, 12 -%else - psrad m2, 6 - psrad m3, 6 -%endif - packssdw m2, m3 -%ifidn %1,sp - packuswb m0, m2 - vextracti128 xm2, m0, 1 - movd [r2], xm0 - pextrw [r2 + 4], xm2, 0 - pextrd [r2 + r3], xm0, 1 - pextrw [r2 + r3 + 4], xm2, 2 - pextrd [r2 + r3 * 2], xm0, 2 - pextrw [r2 + r3 * 2 + 4], xm2, 4 - pextrd [r2 + r6], xm0, 3 - pextrw [r2 + r6 + 4], xm2, 6 -%else - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r6], xm2 - vextracti128 xm0, m0, 1 - vextracti128 xm3, m2, 1 - movd [r2 + 8], xm0 - pextrd [r2 + r3 + 8], xm0, 2 - movd [r2 + r3 * 2 + 8], xm3 - pextrd [r2 + r6 + 8], xm3, 2 -%endif - lea r2, [r2 + r3 * 4] - movu xm1, [r0 + r4] ; m1 = row 7 - punpckhwd xm0, xm6, xm1 - punpcklwd xm6, xm1 - vinserti128 m6, m6, xm0, 1 - pmaddwd m0, m6, [r5 + 1 * mmsize] - pmaddwd m6, [r5] - paddd m4, m0 - lea r0, [r0 + r1 * 4] - movu xm0, [r0] ; m0 = row 8 - punpckhwd xm2, xm1, xm0 - punpcklwd xm1, xm0 - vinserti128 m1, m1, xm2, 1 - pmaddwd m2, m1, [r5 + 1 * mmsize] - pmaddwd m1, [r5] - paddd m5, m2 -%ifidn %1,sp - paddd m4, m8 - paddd m5, m8 - psrad m4, 12 - psrad m5, 12 -%else - psrad m4, 6 - psrad m5, 6 -%endif - packssdw m4, m5 - - movu xm2, [r0 + r1] ; m2 = row 9 - punpckhwd xm5, xm0, xm2 - punpcklwd xm0, xm2 - vinserti128 m0, m0, xm5, 1 - pmaddwd m5, m0, [r5 + 1 * mmsize] - paddd m6, m5 - pmaddwd m0, [r5] - movu xm5, [r0 + r1 * 2] ; m5 = row 10 - punpckhwd xm7, xm2, xm5 - punpcklwd xm2, xm5 - vinserti128 m2, m2, xm7, 1 - pmaddwd m7, m2, [r5 + 1 * mmsize] - paddd m1, m7 - pmaddwd m2, [r5] - -%ifidn %1,sp - paddd m6, m8 - paddd m1, m8 - psrad m6, 12 - psrad m1, 12 -%else - psrad m6, 6 - psrad m1, 6 -%endif - packssdw m6, m1 -%ifidn %1,sp - packuswb m4, m6 - vextracti128 xm6, m4, 1 - movd [r2], xm4 - pextrw [r2 + 4], xm6, 0 - pextrd [r2 + r3], xm4, 1 - pextrw [r2 + r3 + 4], xm6, 2 - pextrd [r2 + r3 * 2], xm4, 2 - pextrw [r2 + r3 * 2 + 4], xm6, 4 - pextrd [r2 + r6], xm4, 3 - pextrw [r2 + r6 + 4], xm6, 6 -%else - movq [r2], xm4 - movhps [r2 + r3], xm4 - movq [r2 + r3 * 2], xm6 - movhps [r2 + r6], xm6 - vextracti128 xm4, m4, 1 - vextracti128 xm1, m6, 1 - movd [r2 + 8], xm4 - pextrd [r2 + r3 + 8], xm4, 2 - movd [r2 + r3 * 2 + 8], xm1 - pextrd [r2 + r6 + 8], xm1, 2 -%endif - lea r2, [r2 + r3 * 4] - movu xm7, [r0 + r4] ; m7 = row 11 - punpckhwd xm1, xm5, xm7 - punpcklwd xm5, xm7 - vinserti128 m5, m5, xm1, 1 - pmaddwd m1, m5, [r5 + 1 * mmsize] - paddd m0, m1 - pmaddwd m5, [r5] - lea r0, [r0 + r1 * 4] - movu xm1, [r0] ; m1 = row 12 - punpckhwd xm4, xm7, xm1 - punpcklwd xm7, xm1 - vinserti128 m7, m7, xm4, 1 - pmaddwd m4, m7, [r5 + 1 * mmsize] - paddd m2, m4 - pmaddwd m7, [r5] -%ifidn %1,sp - paddd m0, m8 - paddd m2, m8 - psrad m0, 12 - psrad m2, 12 -%else - psrad m0, 6 - psrad m2, 6 -%endif - packssdw m0, m2 - - movu xm4, [r0 + r1] ; m4 = row 13 - punpckhwd xm2, xm1, xm4 - punpcklwd xm1, xm4 - vinserti128 m1, m1, xm2, 1 - pmaddwd m2, m1, [r5 + 1 * mmsize] - paddd m5, m2 - pmaddwd m1, [r5] - movu xm2, [r0 + r1 * 2] ; m2 = row 14 - punpckhwd xm6, xm4, xm2 - punpcklwd xm4, xm2 - vinserti128 m4, m4, xm6, 1 - pmaddwd m6, m4, [r5 + 1 * mmsize] - paddd m7, m6 - pmaddwd m4, [r5] -%ifidn %1,sp - paddd m5, m8 - paddd m7, m8 - psrad m5, 12 - psrad m7, 12 -%else - psrad m5, 6 - psrad m7, 6 -%endif - packssdw m5, m7 -%ifidn %1,sp - packuswb m0, m5 - vextracti128 xm5, m0, 1 - movd [r2], xm0 - pextrw [r2 + 4], xm5, 0 - pextrd [r2 + r3], xm0, 1 - pextrw [r2 + r3 + 4], xm5, 2 - pextrd [r2 + r3 * 2], xm0, 2 - pextrw [r2 + r3 * 2 + 4], xm5, 4 - pextrd [r2 + r6], xm0, 3 - pextrw [r2 + r6 + 4], xm5, 6 -%else - movq [r2], xm0 - movhps [r2 + r3], xm0 - movq [r2 + r3 * 2], xm5 - movhps [r2 + r6], xm5 - vextracti128 xm0, m0, 1 - vextracti128 xm7, m5, 1 - movd [r2 + 8], xm0 - pextrd [r2 + r3 + 8], xm0, 2 - movd [r2 + r3 * 2 + 8], xm7 - pextrd [r2 + r6 + 8], xm7, 2 -%endif - lea r2, [r2 + r3 * 4] - - movu xm6, [r0 + r4] ; m6 = row 15 - punpckhwd xm5, xm2, xm6 - punpcklwd xm2, xm6 - vinserti128 m2, m2, xm5, 1 - pmaddwd m5, m2, [r5 + 1 * mmsize] - paddd m1, m5 - pmaddwd m2, [r5] - lea r0, [r0 + r1 * 4] - movu xm0, [r0] ; m0 = row 16 - punpckhwd xm5, xm6, xm0 - punpcklwd xm6, xm0 - vinserti128 m6, m6, xm5, 1 - pmaddwd m5, m6, [r5 + 1 * mmsize] - paddd m4, m5 - pmaddwd m6, [r5] -%ifidn %1,sp - paddd m1, m8 - paddd m4, m8 - psrad m1, 12 - psrad m4, 12 -%else - psrad m1, 6 - psrad m4, 6 -%endif - packssdw m1, m4 - - movu xm5, [r0 + r1] ; m5 = row 17 - punpckhwd xm4, xm0, xm5 - punpcklwd xm0, xm5 - vinserti128 m0, m0, xm4, 1 - pmaddwd m0, [r5 + 1 * mmsize] - paddd m2, m0 - movu xm4, [r0 + r1 * 2] ; m4 = row 18 - punpckhwd xm0, xm5, xm4 - punpcklwd xm5, xm4 - vinserti128 m5, m5, xm0, 1 - pmaddwd m5, [r5 + 1 * mmsize] - paddd m6, m5 -%ifidn %1,sp - paddd m2, m8 - paddd m6, m8 - psrad m2, 12 - psrad m6, 12 -%else - psrad m2, 6 - psrad m6, 6 -%endif - packssdw m2, m6 -%ifidn %1,sp - packuswb m1, m2 - vextracti128 xm2, m1, 1 - movd [r2], xm1 - pextrw [r2 + 4], xm2, 0 - pextrd [r2 + r3], xm1, 1 - pextrw [r2 + r3 + 4], xm2, 2 - pextrd [r2 + r3 * 2], xm1, 2 - pextrw [r2 + r3 * 2 + 4], xm2, 4 - pextrd [r2 + r6], xm1, 3 - pextrw [r2 + r6 + 4], xm2, 6 -%else - movq [r2], xm1 - movhps [r2 + r3], xm1 - movq [r2 + r3 * 2], xm2 - movhps [r2 + r6], xm2 - vextracti128 xm4, m1, 1 - vextracti128 xm6, m2, 1 - movd [r2 + 8], xm4 - pextrd [r2 + r3 + 8], xm4, 2 - movd [r2 + r3 * 2 + 8], xm6 - pextrd [r2 + r6 + 8], xm6, 2 -%endif - RET -%endif -%endmacro - - FILTER_VER_CHROMA_S_AVX2_6x16 sp - FILTER_VER_CHROMA_S_AVX2_6x16 ss - -;--------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vertical_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;--------------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_SS_W2_4R 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_ss_%1x%2, 5, 6, 5 - - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 5 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - - mov r4d, (%2/4) - -.loopH: - PROCESS_CHROMA_SP_W2_4R r5 - - psrad m0, 6 - psrad m2, 6 - - packssdw m0, m2 - - movd [r2], m0 - pextrd [r2 + r3], m0, 1 - lea r2, [r2 + 2 * r3] - pextrd [r2], m0, 2 - pextrd [r2 + r3], m0, 3 - - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loopH - - RET -%endmacro - - FILTER_VER_CHROMA_SS_W2_4R 2, 4 - FILTER_VER_CHROMA_SS_W2_4R 2, 8 - - FILTER_VER_CHROMA_SS_W2_4R 2, 16 - -;--------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_ss_4x2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;--------------------------------------------------------------------------------------------------------------- -INIT_XMM sse2 -cglobal interp_4tap_vert_ss_4x2, 5, 6, 4 - - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 5 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - - movq m0, [r0] - movq m1, [r0 + r1] - punpcklwd m0, m1 ;m0=[0 1] - pmaddwd m0, [r5 + 0 *16] ;m0=[0+1] Row1 - - lea r0, [r0 + 2 * r1] - movq m2, [r0] - punpcklwd m1, m2 ;m1=[1 2] - pmaddwd m1, [r5 + 0 *16] ;m1=[1+2] Row2 - - movq m3, [r0 + r1] - punpcklwd m2, m3 ;m4=[2 3] - pmaddwd m2, [r5 + 1 * 16] - paddd m0, m2 ;m0=[0+1+2+3] Row1 done - psrad m0, 6 - - movq m2, [r0 + 2 * r1] - punpcklwd m3, m2 ;m5=[3 4] - pmaddwd m3, [r5 + 1 * 16] - paddd m1, m3 ;m1=[1+2+3+4] Row2 done - psrad m1, 6 - - packssdw m0, m1 - - movlps [r2], m0 - movhps [r2 + r3], m0 - - RET - -;------------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vertical_ss_6x8(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;------------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_SS_W6_H4 2 -INIT_XMM sse4 -cglobal interp_4tap_vert_ss_6x%2, 5, 7, 6 - - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 5 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r6, [r5 + r4] -%else - lea r6, [tab_ChromaCoeffV + r4] -%endif - - mov r4d, %2/4 - -.loopH: - PROCESS_CHROMA_SP_W4_4R - - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - - packssdw m0, m1 - packssdw m2, m3 - - movlps [r2], m0 - movhps [r2 + r3], m0 - lea r5, [r2 + 2 * r3] - movlps [r5], m2 - movhps [r5 + r3], m2 - - lea r5, [4 * r1 - 2 * 4] - sub r0, r5 - add r2, 2 * 4 - - PROCESS_CHROMA_SP_W2_4R r6 - - psrad m0, 6 - psrad m2, 6 - - packssdw m0, m2 - - movd [r2], m0 - pextrd [r2 + r3], m0, 1 - lea r2, [r2 + 2 * r3] - pextrd [r2], m0, 2 - pextrd [r2 + r3], m0, 3 - - sub r0, 2 * 4 - lea r2, [r2 + 2 * r3 - 2 * 4] - - dec r4d - jnz .loopH - - RET -%endmacro - - FILTER_VER_CHROMA_SS_W6_H4 6, 8 - - FILTER_VER_CHROMA_SS_W6_H4 6, 16 - - -;---------------------------------------------------------------------------------------------------------------- -; void interp_4tap_vert_ss_8x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) -;---------------------------------------------------------------------------------------------------------------- -%macro FILTER_VER_CHROMA_SS_W8_H2 2 -INIT_XMM sse2 -cglobal interp_4tap_vert_ss_%1x%2, 5, 6, 7 - - add r1d, r1d - add r3d, r3d - sub r0, r1 - shl r4d, 5 - -%ifdef PIC - lea r5, [tab_ChromaCoeffV] - lea r5, [r5 + r4] -%else - lea r5, [tab_ChromaCoeffV + r4] -%endif - - mov r4d, %2/2 -.loopH: - PROCESS_CHROMA_SP_W8_2R - - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - - packssdw m0, m1 - packssdw m2, m3 - - movu [r2], m0 - movu [r2 + r3], m2 - - lea r2, [r2 + 2 * r3] - - dec r4d - jnz .loopH - - RET -%endmacro - - FILTER_VER_CHROMA_SS_W8_H2 8, 2 - FILTER_VER_CHROMA_SS_W8_H2 8, 4 - FILTER_VER_CHROMA_SS_W8_H2 8, 6 - FILTER_VER_CHROMA_SS_W8_H2 8, 8 - FILTER_VER_CHROMA_SS_W8_H2 8, 16 - FILTER_VER_CHROMA_SS_W8_H2 8, 32 - - FILTER_VER_CHROMA_SS_W8_H2 8, 12 - FILTER_VER_CHROMA_SS_W8_H2 8, 64 -
Locations
Projects
Search
Status Monitor
Help
Open Build Service
OBS Manuals
API Documentation
OBS Portal
Reporting a Bug
Contact
Mailing List
Forums
Chat (IRC)
Twitter
Open Build Service (OBS)
is an
openSUSE project
.