Projects
Essentials
x265
Sign Up
Log In
Username
Password
Overview
Repositories
Revisions
Requests
Users
Attributes
Meta
Expand all
Collapse all
Changes of Revision 10
View file
x265.changes
Changed
@@ -1,4 +1,45 @@ ------------------------------------------------------------------- +Fri May 29 09:11:02 UTC 2015 - aloisio@gmx.com + +- soname bump to 59 +- Update to version 1.7 + * large amount of assembly code optimizations + * some preliminary support for high dynamic range content + * improvements for multi-library support + * some new quality features + (full documentation at: http://x265.readthedocs.org/en/1.7) + * This release simplifies the multi-library support introduced + in version 1.6. Any libx265 can now forward API requests to + other installed libx265 libraries (by name) so applications + like ffmpeg and the x265 CLI can select between 8bit and 10bit + encodes at runtime without the need of a shim library or + library load path hacks. See --output-depth, and + http://x265.readthedocs.org/en/1.7/api.html#multi-library-interface + * For quality, x265 now allows you to configure the quantization + group size smaller than the CTU size (for finer grained AQ + adjustments). See --qg-size. + * x265 now supports limited mid-encode reconfigure via a new public + method: x265_encoder_reconfig() + * For HDR, x265 now supports signaling the SMPTE 2084 color transfer + function, the SMPTE 2086 mastering display color primaries, and the + content light levels. See --master-display, --max-cll + * x265 will no longer emit any non-conformant bitstreams unless + --allow-non-conformance is specified. + * The x265 CLI now supports a simple encode preview feature. See + --recon-y4m-exec. + * The AnnexB NAL headers can now be configured off, via x265_param.bAnnexB + This is not configurable via the CLI because it is a function of the + muxer being used, and the CLI only supports raw output files. See + --annexb + Misc: + * --lossless encodes are now signaled as level 8.5 + * --profile now has a -P short option + * The regression scripts used by x265 are now public, and can be found at: + https://bitbucket.org/sborho/test-harness + * x265's cmake scripts now support PGO builds, the test-harness can be + used to drive the profile-guided build process. + +------------------------------------------------------------------- Tue Apr 28 20:08:06 UTC 2015 - aloisio@gmx.com - soname bumped to 51
View file
x265.spec
Changed
@@ -1,10 +1,10 @@ # based on the spec file from https://build.opensuse.org/package/view_file/home:Simmphonie/libx265/ Name: x265 -%define soname 51 +%define soname 59 %define libname lib%{name} %define libsoname %{libname}-%{soname} -Version: 1.6 +Version: 1.7 Release: 0 License: GPL-2.0+ Summary: A free h265/HEVC encoder - encoder binary
View file
baselibs.conf
Changed
@@ -1,1 +1,1 @@ -libx265-51 +libx265-59
View file
x265_1.6.tar.gz/.hg_archival.txt -> x265_1.7.tar.gz/.hg_archival.txt
Changed
@@ -1,4 +1,4 @@ repo: 09fe40627f03a0f9c3e6ac78b22ac93da23f9fdf -node: cbeb7d8a4880e4020c4545dd8e498432c3c6cad3 +node: 8425278def1edf0931dc33fc518e1950063e76b0 branch: stable -tag: 1.6 +tag: 1.7
View file
x265_1.6.tar.gz/.hgtags -> x265_1.7.tar.gz/.hgtags
Changed
@@ -14,3 +14,4 @@ c1e4fc0162c14fdb84f5c3bd404fb28cfe10a17f 1.3 5e604833c5aa605d0b6efbe5234492b5e7d8ac61 1.4 9f0324125f53a12f766f6ed6f98f16e2f42337f4 1.5 +cbeb7d8a4880e4020c4545dd8e498432c3c6cad3 1.6
View file
x265_1.6.tar.gz/doc/reST/api.rst -> x265_1.7.tar.gz/doc/reST/api.rst
Changed
@@ -171,8 +171,26 @@ * how x265_encoder_open has changed the parameters. * note that the data accessible through pointers in the returned param struct * (e.g. filenames) should not be modified by the calling application. */ - void x265_encoder_parameters(x265_encoder *, x265_param *); - + void x265_encoder_parameters(x265_encoder *, x265_param *); + +**x265_encoder_reconfig()** may be used to reconfigure encoder parameters mid-encode:: + + /* x265_encoder_reconfig: + * used to modify encoder parameters. + * various parameters from x265_param are copied. + * this takes effect immediately, on whichever frame is encoded next; + * returns 0 on success, negative on parameter validation error. + * + * not all parameters can be changed; see the actual function for a + * detailed breakdown. since not all parameters can be changed, moving + * from preset to preset may not always fully copy all relevant parameters, + * but should still work usably in practice. however, more so than for + * other presets, many of the speed shortcuts used in ultrafast cannot be + * switched out of; using reconfig to switch between ultrafast and other + * presets is not recommended without a more fine-grained breakdown of + * parameters to take this into account. */ + int x265_encoder_reconfig(x265_encoder *, x265_param *); + Pictures ======== @@ -352,7 +370,7 @@ Multi-library Interface ======================= -If your application might want to make a runtime selection between among +If your application might want to make a runtime selection between a number of libx265 libraries (perhaps 8bpp and 16bpp), then you will want to use the multi-library interface. @@ -370,13 +388,34 @@ * libx265 */ const x265_api* x265_api_get(int bitDepth); -The general idea is to request the API for the bitDepth you would prefer -the encoder to use (8 or 10), and if that returns NULL you request the -API for bitDepth=0, which returns the system default libx265. - Note that using this multi-library API in your application is only the -first step. Next your application must dynamically link to libx265 and -then you must build and install a multi-lib configuration of libx265, -which includes 8bpp and 16bpp builds of libx265 and a shim library which -forwards x265_api_get() calls to the appropriate library using dynamic -loading and binding. +first step. + +Your application must link to one build of libx265 (statically or +dynamically) and this linked version of libx265 will support one +bit-depth (8 or 10 bits). + +Your application must now request the API for the bitDepth you would +prefer the encoder to use (8 or 10). If the requested bitdepth is zero, +or if it matches the bitdepth of the system default libx265 (the +currently linked library), then this library will be used for encode. +If you request a different bit-depth, the linked libx265 will attempt +to dynamically bind a shared library with a name appropriate for the +requested bit-depth: + + 8-bit: libx265_main.dll + 10-bit: libx265_main10.dll + + (the shared library extension is obviously platform specific. On + Linux it is .so while on Mac it is .dylib) + +For example on Windows, one could package together an x265.exe +statically linked against the 8bpp libx265 together with a +libx265_main10.dll in the same folder, and this executable would be able +to encode main and main10 bitstreams. + +On Linux, x265 packagers could install 8bpp static and shared libraries +under the name libx265 (so all applications link against 8bpp libx265) +and then also install libx265_main10.so (symlinked to its numbered solib). +Thus applications which use x265_api_get() will be able to generate main +or main10 bitstreams.
View file
x265_1.6.tar.gz/doc/reST/cli.rst -> x265_1.7.tar.gz/doc/reST/cli.rst
Changed
@@ -159,6 +159,13 @@ handled implicitly. One may also directly supply the CPU capability bitmap as an integer. + + Note that by specifying this option you are overriding x265's CPU + detection and it is possible to do this wrong. You can cause encoder + crashes by specifying SIMD architectures which are not supported on + your CPU. + + Default: auto-detected SIMD architectures .. option:: --frame-threads, -F <integer> @@ -171,7 +178,7 @@ Over-allocation of frame threads will not improve performance, it will generally just increase memory use. - **Values:** any value between 8 and 16. Default is 0, auto-detect + **Values:** any value between 0 and 16. Default is 0, auto-detect .. option:: --pools <string>, --numa-pools <string> @@ -201,11 +208,11 @@ their node, they will not be allowed to migrate between nodes, but they will be allowed to move between CPU cores within their node. - If the three pool features: :option:`--wpp` :option:`--pmode` and - :option:`--pme` are all disabled, then :option:`--pools` is ignored - and no thread pools are created. + If the four pool features: :option:`--wpp`, :option:`--pmode`, + :option:`--pme` and :option:`--lookahead-slices` are all disabled, + then :option:`--pools` is ignored and no thread pools are created. - If "none" is specified, then all three of the thread pool features are + If "none" is specified, then all four of the thread pool features are implicitly disabled. Multiple thread pools will be allocated for any NUMA node with more than @@ -217,9 +224,22 @@ :option:`--frame-threads`. The pools are used for WPP and for distributed analysis and motion search. + On Windows, the native APIs offer sufficient functionality to + discover the NUMA topology and enforce the thread affinity that + libx265 needs (so long as you have not chosen to target XP or + Vista), but on POSIX systems it relies on libnuma for this + functionality. If your target POSIX system is single socket, then + building without libnuma is a perfectly reasonable option, as it + will have no effect on the runtime behavior. On a multiple-socket + system, a POSIX build of libx265 without libnuma will be less work + efficient. See :ref:`thread pools <pools>` for more detail. + Default "", one thread is allocated per detected hardware thread (logical CPU cores) and one thread pool per NUMA node. + Note that the string value will need to be escaped or quoted to + protect against shell expansion on many platforms + .. option:: --wpp, --no-wpp Enable Wavefront Parallel Processing. The encoder may begin encoding @@ -399,10 +419,20 @@ **CLI ONLY** +.. option:: --output-depth, -D 8|10 + + Bitdepth of output HEVC bitstream, which is also the internal bit + depth of the encoder. If the requested bit depth is not the bit + depth of the linked libx265, it will attempt to bind libx265_main + for an 8bit encoder, or libx265_main10 for a 10bit encoder, with the + same API version as the linked libx265. + + **CLI ONLY** + Profile, Level, Tier ==================== -.. option:: --profile <string> +.. option:: --profile, -P <string> Enforce the requirements of the specified profile, ensuring the output stream will be decodable by a decoder which supports that @@ -437,7 +467,7 @@ times 10, for example level **5.1** is specified as "5.1" or "51", and level **5.0** is specified as "5.0" or "50". - Annex A levels: 1, 2, 2.1, 3, 3.1, 4, 4.1, 5, 5.1, 5.2, 6, 6.1, 6.2 + Annex A levels: 1, 2, 2.1, 3, 3.1, 4, 4.1, 5, 5.1, 5.2, 6, 6.1, 6.2, 8.5 .. option:: --high-tier, --no-high-tier @@ -464,11 +494,22 @@ HEVC specification. If x265 detects that the total reference count is greater than 8, it will issue a warning that the resulting stream is non-compliant and it signals the stream as profile NONE and level - NONE but still allows the encode to continue. Compliant HEVC + NONE and will abort the encode unless + :option:`--allow-non-conformance` it specified. Compliant HEVC decoders may refuse to decode such streams. Default 3 +.. option:: --allow-non-conformance, --no-allow-non-conformance + + Allow libx265 to generate a bitstream with profile and level NONE. + By default it will abort any encode which does not meet strict level + compliance. The two most likely causes for non-conformance are + :option:`--ctu` being too small, :option:`--ref` being too high, + or the bitrate or resolution being out of specification. + + Default: disabled + .. note:: :option:`--profile`, :option:`--level-idc`, and :option:`--high-tier` are only intended for use when you are @@ -476,7 +517,7 @@ limitations and must constrain the bitstream within those limits. Specifying a profile or level may lower the encode quality parameters to meet those requirements but it will never raise - them. + them. It may enable VBV constraints on a CRF encode. Mode decision / Analysis ======================== @@ -1111,6 +1152,14 @@ **Range of values:** 0.0 to 3.0 +.. option:: --qg-size <64|32|16> + + Enable adaptive quantization for sub-CTUs. This parameter specifies + the minimum CU size at which QP can be adjusted, ie. Quantization Group + size. Allowed range of values are 64, 32, 16 provided this falls within + the inclusive range [maxCUSize, minCUSize]. Experimental. + Default: same as maxCUSize + .. option:: --cutree, --no-cutree Enable the use of lookahead's lowres motion vector fields to @@ -1162,12 +1211,12 @@ .. option:: --strict-cbr, --no-strict-cbr Enables stricter conditions to control bitrate deviance from the - target bitrate in CBR mode. Bitrate adherence is prioritised + target bitrate in ABR mode. Bit rate adherence is prioritised over quality. Rate tolerance is reduced to 50%. Default disabled. This option is for use-cases which require the final average bitrate - to be within very strict limits of the target - preventing overshoots - completely, and achieve bitrates within 5% of target bitrate, + to be within very strict limits of the target; preventing overshoots, + while keeping the bit rate within 5% of the target setting, especially in short segment encodes. Typically, the encoder stays conservative, waiting until there is enough feedback in terms of encoded frames to control QP. strict-cbr allows the encoder to be @@ -1209,7 +1258,7 @@ lookahead). Default value is 0.6. Increasing it to 1 will effectively generate CQP -.. option:: --qstep <integer> +.. option:: --qpstep <integer> The maximum single adjustment in QP allowed to rate control. Default 4 @@ -1451,9 +1500,48 @@ specification for a description of these values. Default undefined (not signaled) +.. option:: --master-display <string> + + SMPTE ST 2086 mastering display color volume SEI info, specified as + a string which is parsed when the stream header SEI are emitted. The + string format is "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)" + where %hu are unsigned 16bit integers and %u are unsigned 32bit + integers. The SEI includes X,Y display primaries for RGB channels, + white point X,Y and max,min luminance values. (HDR) + + Example for P65D3 1000-nits: + + G(13200,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,1) + + Note that this string value will need to be escaped or quoted to + protect against shell expansion on many platforms. No default. + +.. option:: --max-cll <string> + + Maximum content light level and maximum frame average light level as + required by the Consumer Electronics Association 861.3 specification. + + Specified as a string which is parsed when the stream header SEI are + emitted. The string format is "%hu,%hu" where %hu are unsigned 16bit + integers. The first value is the max content light level (or 0 if no + maximum is indicated), the second value is the maximum picture + average light level (or 0). (HDR) + + Note that this string value will need to be escaped or quoted to + protect against shell expansion on many platforms. No default. + Bitstream options ================= +.. option:: --annexb, --no-annexb + + If enabled, x265 will produce Annex B bitstream format, which places + start codes before NAL. If disabled, x265 will produce file format, + which places length before NAL. x265 CLI will choose the right option + based on output format. Default enabled + + **API ONLY** + .. option:: --repeat-headers, --no-repeat-headers If enabled, x265 will emit VPS, SPS, and PPS headers with every @@ -1498,8 +1586,8 @@ Enable a temporal sub layer. All referenced I/P/B frames are in the base layer and all unreferenced B frames are placed in a temporal - sublayer. A decoder may chose to drop the sublayer and only decode - and display the base layer slices. + enhancement layer. A decoder may chose to drop the enhancement layer + and only decode and display the base layer slices. If used with a fixed GOP (:option:`b-adapt` 0) and :option:`bframes` 3 then the two layers evenly split the frame rate, with a cadence of @@ -1525,4 +1613,20 @@ **CLI ONLY** +.. option:: --recon-y4m-exec <string> + + If you have an application which can play a Y4MPEG stream received + on stdin, the x265 CLI can feed it reconstructed pictures in display + order. The pictures will have no timing info, obviously, so the + picture timing will be determined primarily by encoding elapsed time + and latencies, but it can be useful to preview the pictures being + output by the encoder to validate input settings and rate control + parameters. + + Example command for ffplay (assuming it is in your PATH): + + --recon-y4m-exec "ffplay -i pipe:0 -autoexit" + + **CLI ONLY** + .. vim: noet
View file
x265_1.6.tar.gz/doc/reST/threading.rst -> x265_1.7.tar.gz/doc/reST/threading.rst
Changed
@@ -2,6 +2,8 @@ Threading ********* +.. _pools: + Thread Pools ============ @@ -31,6 +33,18 @@ expected to drop that job so the worker thread may go back to the pool and find more work. +On Windows, the native APIs offer sufficient functionality to discover +the NUMA topology and enforce the thread affinity that libx265 needs (so +long as you have not chosen to target XP or Vista), but on POSIX systems +it relies on libnuma for this functionality. If your target POSIX system +is single socket, then building without libnuma is a perfectly +reasonable option, as it will have no effect on the runtime behavior. On +a multiple-socket system, a POSIX build of libx265 without libnuma will +be less work efficient, but will still function correctly. You lose the +work isolation effect that keeps each frame encoder from only using the +threads of a single socket and so you incur a heavier context switching +cost. + Wavefront Parallel Processing ============================= @@ -225,6 +239,7 @@ lowres cost analysis to worker threads. It will use bonded task groups to perform batches of frame cost estimates, and it may optionally use bonded task groups to measure single frame cost estimates using slices. +(see :option:`--lookahead-slices`) The function slicetypeDecide() itself is also be performed by a worker thread if your encoder has a thread pool, else it runs within the
View file
x265_1.6.tar.gz/readme.rst -> x265_1.7.tar.gz/readme.rst
Changed
@@ -3,7 +3,7 @@ ================= | **Read:** | Online `documentation <http://x265.readthedocs.org/en/default/>`_ | Developer `wiki <http://bitbucket.org/multicoreware/x265/wiki/>`_ -| **Download:** | `releases <http://bitbucket.org/multicoreware/x265/downloads/>`_ +| **Download:** | `releases <http://ftp.videolan.org/pub/videolan/x265/>`_ | **Interact:** | #x265 on freenode.irc.net | `x265-devel@videolan.org <http://mailman.videolan.org/listinfo/x265-devel>`_ | `Report an issue <https://bitbucket.org/multicoreware/x265/issues?status=new&status=open>`_ `x265 <https://www.videolan.org/developers/x265.html>`_ is an open
View file
x265_1.6.tar.gz/source/CMakeLists.txt -> x265_1.7.tar.gz/source/CMakeLists.txt
Changed
@@ -30,7 +30,7 @@ mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD) # X265_BUILD must be incremented each time the public API is changed -set(X265_BUILD 51) +set(X265_BUILD 59) configure_file("${PROJECT_SOURCE_DIR}/x265.def.in" "${PROJECT_BINARY_DIR}/x265.def") configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in" @@ -65,15 +65,19 @@ if(LIBRT) list(APPEND PLATFORM_LIBS rt) endif() + find_library(LIBDL dl) + if(LIBDL) + list(APPEND PLATFORM_LIBS dl) + endif() find_package(Numa) if(NUMA_FOUND) - list(APPEND CMAKE_REQUIRED_LIBRARIES ${NUMA_LIBRARY}) + link_directories(${NUMA_LIBRARY_DIR}) + list(APPEND CMAKE_REQUIRED_LIBRARIES numa) check_symbol_exists(numa_node_of_cpu numa.h NUMA_V2) if(NUMA_V2) add_definitions(-DHAVE_LIBNUMA) message(STATUS "libnuma found, building with support for NUMA nodes") - list(APPEND PLATFORM_LIBS ${NUMA_LIBRARY}) - link_directories(${NUMA_LIBRARY_DIR}) + list(APPEND PLATFORM_LIBS numa) include_directories(${NUMA_INCLUDE_DIR}) endif() endif() @@ -90,7 +94,7 @@ if(CMAKE_GENERATOR STREQUAL "Xcode") set(XCODE 1) endif() -if (APPLE) +if(APPLE) add_definitions(-DMACOS) endif() @@ -196,6 +200,7 @@ add_definitions(-static) list(APPEND LINKER_OPTIONS "-static") endif(STATIC_LINK_CRT) + check_cxx_compiler_flag(-Wno-strict-overflow CC_HAS_NO_STRICT_OVERFLOW) check_cxx_compiler_flag(-Wno-narrowing CC_HAS_NO_NARROWING) check_cxx_compiler_flag(-Wno-array-bounds CC_HAS_NO_ARRAY_BOUNDS) if (CC_HAS_NO_ARRAY_BOUNDS) @@ -291,7 +296,7 @@ endif() endif(WARNINGS_AS_ERRORS) -if (WIN32) +if(WIN32) # Visual leak detector find_package(VLD QUIET) if(VLD_FOUND) @@ -300,12 +305,15 @@ list(APPEND PLATFORM_LIBS ${VLD_LIBRARIES}) link_directories(${VLD_LIBRARY_DIRS}) endif() - option(WINXP_SUPPORT "Make binaries compatible with Windows XP" OFF) + option(WINXP_SUPPORT "Make binaries compatible with Windows XP and Vista" OFF) if(WINXP_SUPPORT) # force use of workarounds for CONDITION_VARIABLE and atomic # intrinsics introduced after XP - add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WINXP) - endif() + add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WINXP -D_WIN32_WINNT_WIN7=0x0601) + else(WINXP_SUPPORT) + # default to targeting Windows 7 for the NUMA APIs + add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WIN7) + endif(WINXP_SUPPORT) endif() include(version) # determine X265_VERSION and X265_LATEST_TAG @@ -462,8 +470,10 @@ # Main CLI application option(ENABLE_CLI "Build standalone CLI application" ON) if(ENABLE_CLI) - file(GLOB InputFiles input/*.cpp input/*.h) - file(GLOB OutputFiles output/*.cpp output/*.h) + file(GLOB InputFiles input/input.cpp input/yuv.cpp input/y4m.cpp input/*.h) + file(GLOB OutputFiles output/output.cpp output/reconplay.cpp output/*.h + output/yuv.cpp output/y4m.cpp # recon + output/raw.cpp) # muxers file(GLOB FilterFiles filters/*.cpp filters/*.h) source_group(input FILES ${InputFiles}) source_group(output FILES ${OutputFiles})
View file
x265_1.6.tar.gz/source/common/common.cpp -> x265_1.7.tar.gz/source/common/common.cpp
Changed
@@ -100,11 +100,14 @@ return (x265_exp2_lut[i & 63] + 256) << (i >> 6) >> 8; } -void x265_log(const x265_param *param, int level, const char *fmt, ...) +void general_log(const x265_param* param, const char* caller, int level, const char* fmt, ...) { if (param && level > param->logLevel) return; - const char *log_level; + const int bufferSize = 4096; + char buffer[bufferSize]; + int p = 0; + const char* log_level; switch (level) { case X265_LOG_ERROR: @@ -127,11 +130,13 @@ break; } - fprintf(stderr, "x265 [%s]: ", log_level); + if (caller) + p += sprintf(buffer, "%-4s [%s]: ", caller, log_level); va_list arg; va_start(arg, fmt); - vfprintf(stderr, fmt, arg); + vsnprintf(buffer + p, bufferSize - p, fmt, arg); va_end(arg); + fputs(buffer, stderr); } double x265_ssim2dB(double ssim)
View file
x265_1.6.tar.gz/source/common/common.h -> x265_1.7.tar.gz/source/common/common.h
Changed
@@ -413,7 +413,8 @@ /* outside x265 namespace, but prefixed. defined in common.cpp */ int64_t x265_mdate(void); -void x265_log(const x265_param *param, int level, const char *fmt, ...); +#define x265_log(param, ...) general_log(param, "x265", __VA_ARGS__) +void general_log(const x265_param* param, const char* caller, int level, const char* fmt, ...); int x265_exp2fix8(double x); double x265_ssim2dB(double ssim);
View file
x265_1.6.tar.gz/source/common/constants.cpp -> x265_1.7.tar.gz/source/common/constants.cpp
Changed
@@ -324,7 +324,7 @@ 4, 12, 20, 28, 5, 13, 21, 29, 6, 14, 22, 30, 7, 15, 23, 31, 36, 44, 52, 60, 37, 45, 53, 61, 38, 46, 54, 62, 39, 47, 55, 63 } }; -const uint16_t g_scan4x4[NUM_SCAN_TYPE][4 * 4] = +ALIGN_VAR_16(const uint16_t, g_scan4x4[NUM_SCAN_TYPE][4 * 4]) = { { 0, 4, 1, 8, 5, 2, 12, 9, 6, 3, 13, 10, 7, 14, 11, 15 }, { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
View file
x265_1.6.tar.gz/source/common/contexts.h -> x265_1.7.tar.gz/source/common/contexts.h
Changed
@@ -106,6 +106,7 @@ // private namespace extern const uint32_t g_entropyBits[128]; +extern const uint32_t g_entropyStateBits[128]; extern const uint8_t g_nextState[128][2]; #define sbacGetMps(S) ((S) & 1)
View file
x265_1.6.tar.gz/source/common/cudata.cpp -> x265_1.7.tar.gz/source/common/cudata.cpp
Changed
@@ -298,7 +298,7 @@ } // initialize Sub partition -void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom) +void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom, int qp) { m_absIdxInCTU = cuGeom.absPartIdx; m_encData = ctu.m_encData; @@ -312,8 +312,8 @@ m_cuAboveRight = ctu.m_cuAboveRight; X265_CHECK(m_numPartitions == cuGeom.numPartitions, "initSubCU() size mismatch\n"); - /* sequential memsets */ - m_partSet((uint8_t*)m_qp, (uint8_t)ctu.m_qp[0]); + m_partSet((uint8_t*)m_qp, (uint8_t)qp); + m_partSet(m_log2CUSize, (uint8_t)cuGeom.log2CUSize); m_partSet(m_lumaIntraDir, (uint8_t)DC_IDX); m_partSet(m_tqBypass, (uint8_t)m_encData->m_param->bLossless); @@ -1830,6 +1830,10 @@ } } +/* Clip motion vector to within slightly padded boundary of picture (the + * MV may reference a block that is completely within the padded area). + * Note this function is unaware of how much of this picture is actually + * available for use (re: frame parallelism) */ void CUData::clipMv(MV& outMV) const { const uint32_t mvshift = 2; @@ -2027,6 +2031,7 @@ uint32_t blockSize = 1 << log2CUSize; uint32_t sbWidth = 1 << (g_log2Size[maxCUSize] - log2CUSize); int32_t lastLevelFlag = log2CUSize == g_log2Size[minCUSize]; + for (uint32_t sbY = 0; sbY < sbWidth; sbY++) { for (uint32_t sbX = 0; sbX < sbWidth; sbX++)
View file
x265_1.6.tar.gz/source/common/cudata.h -> x265_1.7.tar.gz/source/common/cudata.h
Changed
@@ -85,8 +85,8 @@ uint32_t childOffset; // offset of the first child CU from current CU uint32_t absPartIdx; // Part index of this CU in terms of 4x4 blocks. uint32_t numPartitions; // Number of 4x4 blocks in the CU - uint32_t depth; // depth of this CU relative from CTU uint32_t flags; // CU flags. + uint32_t depth; // depth of this CU relative from CTU }; struct MVField @@ -182,7 +182,7 @@ static void calcCTUGeoms(uint32_t ctuWidth, uint32_t ctuHeight, uint32_t maxCUSize, uint32_t minCUSize, CUGeom cuDataArray[CUGeom::MAX_GEOMS]); void initCTU(const Frame& frame, uint32_t cuAddr, int qp); - void initSubCU(const CUData& ctu, const CUGeom& cuGeom); + void initSubCU(const CUData& ctu, const CUGeom& cuGeom, int qp); void initLosslessCU(const CUData& cu, const CUGeom& cuGeom); void copyPartFrom(const CUData& cu, const CUGeom& childGeom, uint32_t subPartIdx);
View file
x265_1.6.tar.gz/source/common/dct.cpp -> x265_1.7.tar.gz/source/common/dct.cpp
Changed
@@ -752,7 +752,7 @@ } } -int findPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig) +int scanPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* /*scanCG4x4*/, const int /*trSize*/) { memset(coeffNum, 0, MLS_GRP_NUM * sizeof(*coeffNum)); memset(coeffFlag, 0, MLS_GRP_NUM * sizeof(*coeffFlag)); @@ -785,6 +785,37 @@ return scanPosLast - 1; } +uint32_t findPosFirstLast_c(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]) +{ + int n; + + for (n = SCAN_SET_SIZE - 1; n >= 0; --n) + { + const uint32_t idx = scanTbl[n]; + const uint32_t idxY = idx / MLS_CG_SIZE; + const uint32_t idxX = idx % MLS_CG_SIZE; + if (dstCoeff[idxY * trSize + idxX]) + break; + } + + X265_CHECK(n >= 0, "non-zero coeff scan failuare!\n"); + + uint32_t lastNZPosInCG = (uint32_t)n; + + for (n = 0;; n++) + { + const uint32_t idx = scanTbl[n]; + const uint32_t idxY = idx / MLS_CG_SIZE; + const uint32_t idxX = idx % MLS_CG_SIZE; + if (dstCoeff[idxY * trSize + idxX]) + break; + } + + uint32_t firstNZPosInCG = (uint32_t)n; + + return ((lastNZPosInCG << 16) | firstNZPosInCG); +} + } // closing - anonymous file-static namespace namespace x265 { @@ -817,6 +848,7 @@ p.cu[BLOCK_16x16].copy_cnt = copy_count<16>; p.cu[BLOCK_32x32].copy_cnt = copy_count<32>; - p.findPosLast = findPosLast_c; + p.scanPosLast = scanPosLast_c; + p.findPosFirstLast = findPosFirstLast_c; } }
View file
x265_1.6.tar.gz/source/common/frame.cpp -> x265_1.7.tar.gz/source/common/frame.cpp
Changed
@@ -31,18 +31,21 @@ Frame::Frame() { m_bChromaExtended = false; + m_lowresInit = false; m_reconRowCount.set(0); m_countRefEncoders = 0; m_encData = NULL; m_reconPic = NULL; m_next = NULL; m_prev = NULL; + m_param = NULL; memset(&m_lowres, 0, sizeof(m_lowres)); } bool Frame::create(x265_param *param) { m_fencPic = new PicYuv; + m_param = param; return m_fencPic->create(param->sourceWidth, param->sourceHeight, param->internalCsp) && m_lowres.create(m_fencPic, param->bframes, !!param->rc.aqMode);
View file
x265_1.6.tar.gz/source/common/frame.h -> x265_1.7.tar.gz/source/common/frame.h
Changed
@@ -56,6 +56,7 @@ void* m_userData; // user provided pointer passed in with this picture Lowres m_lowres; + bool m_lowresInit; // lowres init complete (pre-analysis) bool m_bChromaExtended; // orig chroma planes motion extended for weight analysis /* Frame Parallelism - notification between FrameEncoders of available motion reference rows */ @@ -64,7 +65,7 @@ Frame* m_next; // PicList doubly linked list pointers Frame* m_prev; - + x265_param* m_param; // Points to the latest param set for the frame. x265_analysis_data m_analysisData; Frame();
View file
x265_1.6.tar.gz/source/common/framedata.h -> x265_1.7.tar.gz/source/common/framedata.h
Changed
@@ -74,6 +74,7 @@ uint32_t numEncodedCUs; /* ctuAddr of last encoded CTU in row */ uint32_t encodedBits; /* sum of 'totalBits' of encoded CTUs */ uint32_t satdForVbv; /* sum of lowres (estimated) costs for entire row */ + uint32_t intraSatdForVbv; /* sum of lowres (estimated) intra costs for entire row */ uint32_t diagSatd; uint32_t diagIntraSatd; double diagQp;
View file
x265_1.6.tar.gz/source/common/ipfilter.cpp -> x265_1.7.tar.gz/source/common/ipfilter.cpp
Changed
@@ -34,27 +34,8 @@ #endif namespace { -template<int dstStride, int width, int height> -void pixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst) -{ - int shift = IF_INTERNAL_PREC - X265_DEPTH; - int row, col; - - for (row = 0; row < height; row++) - { - for (col = 0; col < width; col++) - { - int16_t val = src[col] << shift; - dst[col] = val - (int16_t)IF_INTERNAL_OFFS; - } - - src += srcStride; - dst += dstStride; - } -} - -template<int dstStride> -void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height) +template<int width, int height> +void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride) { int shift = IF_INTERNAL_PREC - X265_DEPTH; int row, col; @@ -398,7 +379,7 @@ p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>; \ p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>; \ p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>; #define CHROMA_422(W, H) \ p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \ @@ -407,7 +388,7 @@ p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>; \ p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>; \ p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>; #define CHROMA_444(W, H) \ p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \ @@ -416,7 +397,7 @@ p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>; \ p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>; \ p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \ - p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>; + p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>; #define LUMA(W, H) \ p.pu[LUMA_ ## W ## x ## H].luma_hpp = interp_horiz_pp_c<8, W, H>; \ @@ -426,7 +407,7 @@ p.pu[LUMA_ ## W ## x ## H].luma_vsp = interp_vert_sp_c<8, W, H>; \ p.pu[LUMA_ ## W ## x ## H].luma_vss = interp_vert_ss_c<8, W, H>; \ p.pu[LUMA_ ## W ## x ## H].luma_hvpp = interp_hv_pp_c<8, W, H>; \ - p.pu[LUMA_ ## W ## x ## H].filter_p2s = pixelToShort_c<MAX_CU_SIZE, W, H> + p.pu[LUMA_ ## W ## x ## H].convert_p2s = filterPixelToShort_c<W, H>; void setupFilterPrimitives_c(EncoderPrimitives& p) { @@ -482,6 +463,7 @@ CHROMA_422(4, 8); CHROMA_422(4, 4); + CHROMA_422(2, 4); CHROMA_422(2, 8); CHROMA_422(8, 16); CHROMA_422(8, 8); @@ -530,11 +512,6 @@ CHROMA_444(48, 64); CHROMA_444(64, 16); CHROMA_444(16, 64); - p.luma_p2s = filterPixelToShort_c<MAX_CU_SIZE>; - - p.chroma[X265_CSP_I444].p2s = filterPixelToShort_c<MAX_CU_SIZE>; - p.chroma[X265_CSP_I420].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>; - p.chroma[X265_CSP_I422].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>; p.extendRowBorder = extendCURowColBorder; }
View file
x265_1.6.tar.gz/source/common/loopfilter.cpp -> x265_1.7.tar.gz/source/common/loopfilter.cpp
Changed
@@ -42,18 +42,23 @@ dst[x] = signOf(src1[x] - src2[x]); } -void processSaoCUE0(pixel * rec, int8_t * offsetEo, int width, int8_t signLeft) +void processSaoCUE0(pixel * rec, int8_t * offsetEo, int width, int8_t* signLeft, intptr_t stride) { - int x; - int8_t signRight; + int x, y; + int8_t signRight, signLeft0; int8_t edgeType; - for (x = 0; x < width; x++) + for (y = 0; y < 2; y++) { - signRight = ((rec[x] - rec[x + 1]) < 0) ? -1 : ((rec[x] - rec[x + 1]) > 0) ? 1 : 0; - edgeType = signRight + signLeft + 2; - signLeft = -signRight; - rec[x] = x265_clip(rec[x] + offsetEo[edgeType]); + signLeft0 = signLeft[y]; + for (x = 0; x < width; x++) + { + signRight = ((rec[x] - rec[x + 1]) < 0) ? -1 : ((rec[x] - rec[x + 1]) > 0) ? 1 : 0; + edgeType = signRight + signLeft0 + 2; + signLeft0 = -signRight; + rec[x] = x265_clip(rec[x] + offsetEo[edgeType]); + } + rec += stride; } } @@ -72,6 +77,25 @@ } } +void processSaoCUE1_2Rows(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width) +{ + int x, y; + int8_t signDown; + int edgeType; + + for (y = 0; y < 2; y++) + { + for (x = 0; x < width; x++) + { + signDown = signOf(rec[x] - rec[x + stride]); + edgeType = signDown + upBuff1[x] + 2; + upBuff1[x] = -signDown; + rec[x] = x265_clip(rec[x] + offsetEo[edgeType]); + } + rec += stride; + } +} + void processSaoCUE2(pixel * rec, int8_t * bufft, int8_t * buff1, int8_t * offsetEo, int width, intptr_t stride) { int x; @@ -119,8 +143,11 @@ { p.saoCuOrgE0 = processSaoCUE0; p.saoCuOrgE1 = processSaoCUE1; - p.saoCuOrgE2 = processSaoCUE2; - p.saoCuOrgE3 = processSaoCUE3; + p.saoCuOrgE1_2Rows = processSaoCUE1_2Rows; + p.saoCuOrgE2[0] = processSaoCUE2; + p.saoCuOrgE2[1] = processSaoCUE2; + p.saoCuOrgE3[0] = processSaoCUE3; + p.saoCuOrgE3[1] = processSaoCUE3; p.saoCuOrgB0 = processSaoCUB0; p.sign = calSign; }
View file
x265_1.6.tar.gz/source/common/param.cpp -> x265_1.7.tar.gz/source/common/param.cpp
Changed
@@ -87,7 +87,7 @@ extern "C" void x265_param_free(x265_param* p) { - return x265_free(p); + x265_free(p); } extern "C" @@ -117,6 +117,7 @@ param->levelIdc = 0; param->bHighTier = 0; param->interlaceMode = 0; + param->bAnnexB = 1; param->bRepeatHeaders = 0; param->bEnableAccessUnitDelimiters = 0; param->bEmitHRDSEI = 0; @@ -209,6 +210,7 @@ param->rc.zones = NULL; param->rc.bEnableSlowFirstPass = 0; param->rc.bStrictCbr = 0; + param->rc.qgSize = 64; /* Same as maxCUSize */ /* Video Usability Information (VUI) */ param->vui.aspectRatioIdc = 0; @@ -263,6 +265,7 @@ param->rc.aqStrength = 0.0; param->rc.aqMode = X265_AQ_NONE; param->rc.cuTree = 0; + param->rc.qgSize = 32; param->bEnableFastIntra = 1; } else if (!strcmp(preset, "superfast")) @@ -279,6 +282,7 @@ param->rc.aqStrength = 0.0; param->rc.aqMode = X265_AQ_NONE; param->rc.cuTree = 0; + param->rc.qgSize = 32; param->bEnableSAO = 0; param->bEnableFastIntra = 1; } @@ -292,6 +296,7 @@ param->rdLevel = 2; param->maxNumReferences = 1; param->rc.cuTree = 0; + param->rc.qgSize = 32; param->bEnableFastIntra = 1; } else if (!strcmp(preset, "faster")) @@ -565,6 +570,7 @@ p->levelIdc = atoi(value); } OPT("high-tier") p->bHighTier = atobool(value); + OPT("allow-non-conformance") p->bAllowNonConformance = atobool(value); OPT2("log-level", "log") { p->logLevel = atoi(value); @@ -575,6 +581,7 @@ } } OPT("cu-stats") p->bLogCuStats = atobool(value); + OPT("annexb") p->bAnnexB = atobool(value); OPT("repeat-headers") p->bRepeatHeaders = atobool(value); OPT("wpp") p->bEnableWavefront = atobool(value); OPT("ctu") p->maxCUSize = (uint32_t)atoi(value); @@ -843,6 +850,9 @@ OPT2("pools", "numa-pools") p->numaPools = strdup(value); OPT("lambda-file") p->rc.lambdaFileName = strdup(value); OPT("analysis-file") p->analysisFileName = strdup(value); + OPT("qg-size") p->rc.qgSize = atoi(value); + OPT("master-display") p->masteringDisplayColorVolume = strdup(value); + OPT("max-cll") p->contentLightLevelInfo = strdup(value); else return X265_PARAM_BAD_NAME; #undef OPT @@ -1183,7 +1193,7 @@ uint32_t maxLog2CUSize = (uint32_t)g_log2Size[param->maxCUSize]; uint32_t minLog2CUSize = (uint32_t)g_log2Size[param->minCUSize]; - if (g_ctuSizeConfigured || ATOMIC_INC(&g_ctuSizeConfigured) > 1) + if (ATOMIC_INC(&g_ctuSizeConfigured) > 1) { if (g_maxCUSize != param->maxCUSize) { @@ -1264,22 +1274,20 @@ x265_log(param, X265_LOG_INFO, "b-pyramid / weightp / weightb / refs: %d / %d / %d / %d\n", param->bBPyramid, param->bEnableWeightedPred, param->bEnableWeightedBiPred, param->maxNumReferences); + if (param->rc.aqMode) + x265_log(param, X265_LOG_INFO, "AQ: mode / str / qg-size / cu-tree : %d / %0.1f / %d / %d\n", param->rc.aqMode, + param->rc.aqStrength, param->rc.qgSize, param->rc.cuTree); + if (param->bLossless) x265_log(param, X265_LOG_INFO, "Rate Control : Lossless\n"); else switch (param->rc.rateControlMode) { case X265_RC_ABR: - x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : ABR-%d kbps / %0.1f / %d\n", param->rc.bitrate, - param->rc.aqStrength, param->rc.cuTree); - break; + x265_log(param, X265_LOG_INFO, "Rate Control / qCompress : ABR-%d kbps / %0.2f\n", param->rc.bitrate, param->rc.qCompress); break; case X265_RC_CQP: - x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : CQP-%d / %0.1f / %d\n", param->rc.qp, param->rc.aqStrength, - param->rc.cuTree); - break; + x265_log(param, X265_LOG_INFO, "Rate Control : CQP-%d\n", param->rc.qp); break; case X265_RC_CRF: - x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : CRF-%0.1f / %0.1f / %d\n", param->rc.rfConstant, - param->rc.aqStrength, param->rc.cuTree); - break; + x265_log(param, X265_LOG_INFO, "Rate Control / qCompress : CRF-%0.1f / %0.2f\n", param->rc.rfConstant, param->rc.qCompress); break; } if (param->rc.vbvBufferSize) @@ -1327,6 +1335,43 @@ fflush(stderr); } +void x265_print_reconfigured_params(x265_param* param, x265_param* reconfiguredParam) +{ + if (!param || !reconfiguredParam) + return; + + x265_log(param,X265_LOG_INFO, "Reconfigured param options :\n"); + + char buf[80] = { 0 }; + char tmp[40]; +#define TOOLCMP(COND1, COND2, STR, VAL) if (COND1 != COND2) { sprintf(tmp, STR, VAL); appendtool(param, buf, sizeof(buf), tmp); } + TOOLCMP(param->maxNumReferences, reconfiguredParam->maxNumReferences, "ref=%d", reconfiguredParam->maxNumReferences); + TOOLCMP(param->maxTUSize, reconfiguredParam->maxTUSize, "max-tu-size=%d", reconfiguredParam->maxTUSize); + TOOLCMP(param->searchRange, reconfiguredParam->searchRange, "merange=%d", reconfiguredParam->searchRange); + TOOLCMP(param->subpelRefine, reconfiguredParam->subpelRefine, "subme= %d", reconfiguredParam->subpelRefine); + TOOLCMP(param->rdLevel, reconfiguredParam->rdLevel, "rd=%d", reconfiguredParam->rdLevel); + TOOLCMP(param->psyRd, reconfiguredParam->psyRd, "psy-rd=%.2lf", reconfiguredParam->psyRd); + TOOLCMP(param->rdoqLevel, reconfiguredParam->rdoqLevel, "rdoq=%d", reconfiguredParam->rdoqLevel); + TOOLCMP(param->psyRdoq, reconfiguredParam->psyRdoq, "psy-rdoq=%.2lf", reconfiguredParam->psyRdoq); + TOOLCMP(param->noiseReductionIntra, reconfiguredParam->noiseReductionIntra, "nr-intra=%d", reconfiguredParam->noiseReductionIntra); + TOOLCMP(param->noiseReductionInter, reconfiguredParam->noiseReductionInter, "nr-inter=%d", reconfiguredParam->noiseReductionInter); + TOOLCMP(param->bEnableTSkipFast, reconfiguredParam->bEnableTSkipFast, "tskip-fast=%d", reconfiguredParam->bEnableTSkipFast); + TOOLCMP(param->bEnableSignHiding, reconfiguredParam->bEnableSignHiding, "signhide=%d", reconfiguredParam->bEnableSignHiding); + TOOLCMP(param->bEnableFastIntra, reconfiguredParam->bEnableFastIntra, "fast-intra=%d", reconfiguredParam->bEnableFastIntra); + if (param->bEnableLoopFilter && (param->deblockingFilterBetaOffset != reconfiguredParam->deblockingFilterBetaOffset + || param->deblockingFilterTCOffset != reconfiguredParam->deblockingFilterTCOffset)) + { + sprintf(tmp, "deblock(tC=%d:B=%d)", param->deblockingFilterTCOffset, param->deblockingFilterBetaOffset); + appendtool(param, buf, sizeof(buf), tmp); + } + else + TOOLCMP(param->bEnableLoopFilter, reconfiguredParam->bEnableLoopFilter, "deblock=%d", reconfiguredParam->bEnableLoopFilter); + + TOOLCMP(param->bEnableTemporalMvp, reconfiguredParam->bEnableTemporalMvp, "tmvp=%d", reconfiguredParam->bEnableTemporalMvp); + TOOLCMP(param->bEnableEarlySkip, reconfiguredParam->bEnableEarlySkip, "early-skip=%d", reconfiguredParam->bEnableEarlySkip); + x265_log(param, X265_LOG_INFO, "tools:%s\n", buf); +} + char *x265_param2string(x265_param* p) { char *buf, *s;
View file
x265_1.6.tar.gz/source/common/param.h -> x265_1.7.tar.gz/source/common/param.h
Changed
@@ -28,6 +28,7 @@ int x265_check_params(x265_param *param); int x265_set_globals(x265_param *param); void x265_print_params(x265_param *param); +void x265_print_reconfigured_params(x265_param* param, x265_param* reconfiguredParam); void x265_param_apply_fastfirstpass(x265_param *p); char* x265_param2string(x265_param *param); int x265_atoi(const char *str, bool& bError);
View file
x265_1.6.tar.gz/source/common/picyuv.cpp -> x265_1.7.tar.gz/source/common/picyuv.cpp
Changed
@@ -175,8 +175,7 @@ for (int r = 0; r < height; r++) { - for (int c = 0; c < width; c++) - yPixel[c] = (pixel)yChar[c]; + memcpy(yPixel, yChar, width * sizeof(pixel)); yPixel += m_stride; yChar += pic.stride[0] / sizeof(*yChar); @@ -184,11 +183,8 @@ for (int r = 0; r < height >> m_vChromaShift; r++) { - for (int c = 0; c < width >> m_hChromaShift; c++) - { - uPixel[c] = (pixel)uChar[c]; - vPixel[c] = (pixel)vChar[c]; - } + memcpy(uPixel, uChar, (width >> m_hChromaShift) * sizeof(pixel)); + memcpy(vPixel, vChar, (width >> m_hChromaShift) * sizeof(pixel)); uPixel += m_strideC; vPixel += m_strideC;
View file
x265_1.6.tar.gz/source/common/pixel.cpp -> x265_1.7.tar.gz/source/common/pixel.cpp
Changed
@@ -582,7 +582,7 @@ } } -void scale1D_128to64(pixel *dst, const pixel *src, intptr_t /*stride*/) +void scale1D_128to64(pixel *dst, const pixel *src) { int x; const pixel* src1 = src;
View file
x265_1.6.tar.gz/source/common/predict.cpp -> x265_1.7.tar.gz/source/common/predict.cpp
Changed
@@ -273,7 +273,7 @@ void Predict::predInterLumaShort(const PredictionUnit& pu, ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const { int16_t* dst = dstSYuv.getLumaAddr(pu.puAbsPartIdx); - int dstStride = dstSYuv.m_size; + intptr_t dstStride = dstSYuv.m_size; intptr_t srcStride = refPic.m_stride; intptr_t srcOffset = (mv.x >> 2) + (mv.y >> 2) * srcStride; @@ -288,7 +288,7 @@ X265_CHECK(dstStride == MAX_CU_SIZE, "stride expected to be max cu size\n"); if (!(yFrac | xFrac)) - primitives.luma_p2s(src, srcStride, dst, pu.width, pu.height); + primitives.pu[partEnum].convert_p2s(src, srcStride, dst, dstStride); else if (!yFrac) primitives.pu[partEnum].luma_hps(src, srcStride, dst, dstStride, xFrac, 0); else if (!xFrac) @@ -375,14 +375,13 @@ int partEnum = partitionFromSizes(pu.width, pu.height); uint32_t cxWidth = pu.width >> m_hChromaShift; - uint32_t cxHeight = pu.height >> m_vChromaShift; - X265_CHECK(((cxWidth | cxHeight) % 2) == 0, "chroma block size expected to be multiple of 2\n"); + X265_CHECK(((cxWidth | (pu.height >> m_vChromaShift)) % 2) == 0, "chroma block size expected to be multiple of 2\n"); if (!(yFrac | xFrac)) { - primitives.chroma[m_csp].p2s(refCb, refStride, dstCb, cxWidth, cxHeight); - primitives.chroma[m_csp].p2s(refCr, refStride, dstCr, cxWidth, cxHeight); + primitives.chroma[m_csp].pu[partEnum].p2s(refCb, refStride, dstCb, dstStride); + primitives.chroma[m_csp].pu[partEnum].p2s(refCr, refStride, dstCr, dstStride); } else if (!yFrac) { @@ -817,7 +816,9 @@ const pixel refSample = *pAdiLineNext; // Pad unavailable samples with new value int nextOrTop = X265_MIN(next, leftUnits); + // fill left column +#if HIGH_BIT_DEPTH while (curr < nextOrTop) { for (int i = 0; i < unitHeight; i++) @@ -836,6 +837,24 @@ adi += unitWidth; curr++; } +#else + X265_CHECK(curr <= nextOrTop, "curr must be less than or equal to nextOrTop\n"); + if (curr < nextOrTop) + { + const int fillSize = unitHeight * (nextOrTop - curr); + memset(adi, refSample, fillSize * sizeof(pixel)); + curr = nextOrTop; + adi += fillSize; + } + + if (curr < next) + { + const int fillSize = unitWidth * (next - curr); + memset(adi, refSample, fillSize * sizeof(pixel)); + curr = next; + adi += fillSize; + } +#endif } // pad all other reference samples.
View file
x265_1.6.tar.gz/source/common/primitives.cpp -> x265_1.7.tar.gz/source/common/primitives.cpp
Changed
@@ -90,7 +90,6 @@ /* alias chroma 4:4:4 from luma primitives (all but chroma filters) */ - p.chroma[X265_CSP_I444].p2s = p.luma_p2s; p.chroma[X265_CSP_I444].cu[BLOCK_4x4].sa8d = NULL; for (int i = 0; i < NUM_PU_SIZES; i++) @@ -98,7 +97,7 @@ p.chroma[X265_CSP_I444].pu[i].copy_pp = p.pu[i].copy_pp; p.chroma[X265_CSP_I444].pu[i].addAvg = p.pu[i].addAvg; p.chroma[X265_CSP_I444].pu[i].satd = p.pu[i].satd; - p.chroma[X265_CSP_I444].pu[i].chroma_p2s = p.pu[i].filter_p2s; + p.chroma[X265_CSP_I444].pu[i].p2s = p.pu[i].convert_p2s; } for (int i = 0; i < NUM_CU_SIZES; i++)
View file
x265_1.6.tar.gz/source/common/primitives.h -> x265_1.7.tar.gz/source/common/primitives.h
Changed
@@ -140,7 +140,8 @@ typedef int(*count_nonzero_t)(const int16_t* quantCoeff); typedef void (*weightp_pp_t)(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset); typedef void (*weightp_sp_t)(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset); -typedef void (*scale_t)(pixel* dst, const pixel* src, intptr_t stride); +typedef void (*scale1D_t)(pixel* dst, const pixel* src); +typedef void (*scale2D_t)(pixel* dst, const pixel* src, intptr_t stride); typedef void (*downscale_t)(const pixel* src0, pixel* dstf, pixel* dsth, pixel* dstv, pixel* dstc, intptr_t src_stride, intptr_t dst_stride, int width, int height); typedef void (*extendCURowBorder_t)(pixel* txt, intptr_t stride, int width, int height, int marginX); @@ -155,8 +156,7 @@ typedef void (*filter_sp_t) (const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); typedef void (*filter_ss_t) (const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); typedef void (*filter_hv_pp_t) (const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); -typedef void (*filter_p2s_wxh_t)(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height); -typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst); +typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); typedef void (*copy_pp_t)(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); // dst is aligned typedef void (*copy_sp_t)(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); @@ -168,7 +168,7 @@ typedef void (*pixelavg_pp_t)(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int weight); typedef void (*addAvg_t)(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); -typedef void (*saoCuOrgE0_t)(pixel* rec, int8_t* offsetEo, int width, int8_t signLeft); +typedef void (*saoCuOrgE0_t)(pixel* rec, int8_t* offsetEo, int width, int8_t* signLeft, intptr_t stride); typedef void (*saoCuOrgE1_t)(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width); typedef void (*saoCuOrgE2_t)(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride); typedef void (*saoCuOrgE3_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX); @@ -179,7 +179,8 @@ typedef void (*cutree_propagate_cost) (int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len); -typedef int (*findPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig); +typedef int (*scanPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize); +typedef uint32_t (*findPosFirstLast_t)(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]); /* Function pointers to optimized encoder primitives. Each pointer can reference * either an assembly routine, a SIMD intrinsic primitive, or a C function */ @@ -210,7 +211,7 @@ addAvg_t addAvg; // bidir motion compensation, uses 16bit values copy_pp_t copy_pp; - filter_p2s_t filter_p2s; + filter_p2s_t convert_p2s; } pu[NUM_PU_SIZES]; @@ -266,17 +267,26 @@ dequant_scaling_t dequant_scaling; dequant_normal_t dequant_normal; denoiseDct_t denoiseDct; - scale_t scale1D_128to64; - scale_t scale2D_64to32; + scale1D_t scale1D_128to64; + scale2D_t scale2D_64to32; ssim_4x4x2_core_t ssim_4x4x2_core; ssim_end4_t ssim_end_4; sign_t sign; saoCuOrgE0_t saoCuOrgE0; - saoCuOrgE1_t saoCuOrgE1; - saoCuOrgE2_t saoCuOrgE2; - saoCuOrgE3_t saoCuOrgE3; + + /* To avoid the overhead in avx2 optimization in handling width=16, SAO_E0_1 is split + * into two parts: saoCuOrgE1, saoCuOrgE1_2Rows */ + saoCuOrgE1_t saoCuOrgE1, saoCuOrgE1_2Rows; + + // saoCuOrgE2[0] is used for width<=16 and saoCuOrgE2[1] is used for width > 16. + saoCuOrgE2_t saoCuOrgE2[2]; + + /* In avx2 optimization, two rows cannot be handled simultaneously since it requires + * a pixel from the previous row. So, saoCuOrgE3[0] is used for width<=16 and + * saoCuOrgE3[1] is used for width > 16. */ + saoCuOrgE3_t saoCuOrgE3[2]; saoCuOrgB0_t saoCuOrgB0; downscale_t frameInitLowres; @@ -289,9 +299,9 @@ weightp_sp_t weight_sp; weightp_pp_t weight_pp; - filter_p2s_wxh_t luma_p2s; - findPosLast_t findPosLast; + scanPosLast_t scanPosLast; + findPosFirstLast_t findPosFirstLast; /* There is one set of chroma primitives per color space. An encoder will * have just a single color space and thus it will only ever use one entry @@ -316,7 +326,7 @@ filter_hps_t filter_hps; addAvg_t addAvg; copy_pp_t copy_pp; - filter_p2s_t chroma_p2s; + filter_p2s_t p2s; } pu[NUM_PU_SIZES]; @@ -336,7 +346,6 @@ } cu[NUM_CU_SIZES]; - filter_p2s_wxh_t p2s; // takes width/height as arguments } chroma[X265_CSP_COUNT]; };
View file
x265_1.6.tar.gz/source/common/quant.cpp -> x265_1.7.tar.gz/source/common/quant.cpp
Changed
@@ -198,7 +198,8 @@ { m_entropyCoder = &entropy; m_rdoqLevel = rdoqLevel; - m_psyRdoqScale = (int64_t)(psyScale * 256.0); + m_psyRdoqScale = (int32_t)(psyScale * 256.0); + X265_CHECK((psyScale * 256.0) < (double)MAX_INT, "psyScale value too large\n"); m_scalingList = &scalingList; m_resiDctCoeff = X265_MALLOC(int16_t, MAX_TR_SIZE * MAX_TR_SIZE * 2); m_fencDctCoeff = m_resiDctCoeff + (MAX_TR_SIZE * MAX_TR_SIZE); @@ -225,16 +226,15 @@ X265_FREE(m_fencShortBuf); } -void Quant::setQPforQuant(const CUData& cu) +void Quant::setQPforQuant(const CUData& ctu, int qp) { - m_tqBypass = !!cu.m_tqBypass[0]; + m_tqBypass = !!ctu.m_tqBypass[0]; if (m_tqBypass) return; - m_nr = m_frameNr ? &m_frameNr[cu.m_encData->m_frameEncoderID] : NULL; - int qpy = cu.m_qp[0]; - m_qpParam[TEXT_LUMA].setQpParam(qpy + QP_BD_OFFSET); - setChromaQP(qpy + cu.m_slice->m_pps->chromaQpOffset[0], TEXT_CHROMA_U, cu.m_chromaFormat); - setChromaQP(qpy + cu.m_slice->m_pps->chromaQpOffset[1], TEXT_CHROMA_V, cu.m_chromaFormat); + m_nr = m_frameNr ? &m_frameNr[ctu.m_encData->m_frameEncoderID] : NULL; + m_qpParam[TEXT_LUMA].setQpParam(qp + QP_BD_OFFSET); + setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[0], TEXT_CHROMA_U, ctu.m_chromaFormat); + setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[1], TEXT_CHROMA_V, ctu.m_chromaFormat); } void Quant::setChromaQP(int qpin, TextType ttype, int chFmt) @@ -515,6 +515,7 @@ { int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */ int scalingListType = (cu.isIntra(absPartIdx) ? 0 : 3) + ttype; + const uint32_t usePsyMask = usePsy ? -1 : 0; X265_CHECK(scalingListType < 6, "scaling list type out of range\n"); @@ -529,9 +530,10 @@ X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(dstCoeff), "numSig differ\n"); if (!numSig) return 0; + uint32_t trSize = 1 << log2TrSize; int64_t lambda2 = m_qpParam[ttype].lambda2; - int64_t psyScale = (m_psyRdoqScale * m_qpParam[ttype].lambda); + const int64_t psyScale = ((int64_t)m_psyRdoqScale * m_qpParam[ttype].lambda); /* unquant constants for measuring distortion. Scaling list quant coefficients have a (1 << 4) * scale applied that must be removed during unquant. Note that in real dequant there is clipping @@ -544,7 +546,7 @@ #define UNQUANT(lvl) (((lvl) * (unquantScale[blkPos] << per) + unquantRound) >> unquantShift) #define SIGCOST(bits) ((lambda2 * (bits)) >> 8) #define RDCOST(d, bits) ((((int64_t)d * d) << scaleBits) + SIGCOST(bits)) -#define PSYVALUE(rec) ((psyScale * (rec)) >> (16 - scaleBits)) +#define PSYVALUE(rec) ((psyScale * (rec)) >> (2 * transformShift + 1)) int64_t costCoeff[32 * 32]; /* d*d + lambda * bits */ int64_t costUncoded[32 * 32]; /* d*d + lambda * 0 */ @@ -557,14 +559,6 @@ int64_t costCoeffGroupSig[MLS_GRP_NUM]; /* lambda * bits of group coding cost */ uint64_t sigCoeffGroupFlag64 = 0; - uint32_t ctxSet = 0; - int c1 = 1; - int c2 = 0; - uint32_t goRiceParam = 0; - uint32_t c1Idx = 0; - uint32_t c2Idx = 0; - int cgLastScanPos = -1; - int lastScanPos = -1; const uint32_t cgSize = (1 << MLS_CG_SIZE); /* 4x4 num coef = 16 */ bool bIsLuma = ttype == TEXT_LUMA; @@ -579,30 +573,231 @@ TUEntropyCodingParameters codeParams; cu.getTUEntropyCodingParameters(codeParams, absPartIdx, log2TrSize, bIsLuma); const uint32_t cgNum = 1 << (codeParams.log2TrSizeCG * 2); + const uint32_t cgStride = (trSize >> MLS_CG_LOG2_SIZE); + + uint8_t coeffNum[MLS_GRP_NUM]; // value range[0, 16] + uint16_t coeffSign[MLS_GRP_NUM]; // bit mask map for non-zero coeff sign + uint16_t coeffFlag[MLS_GRP_NUM]; // bit mask map for non-zero coeff + +#if CHECKED_BUILD || _DEBUG + // clean output buffer, the asm version of scanPosLast Never output anything after latest non-zero coeff group + memset(coeffNum, 0, sizeof(coeffNum)); + memset(coeffSign, 0, sizeof(coeffNum)); + memset(coeffFlag, 0, sizeof(coeffNum)); +#endif + const int lastScanPos = primitives.scanPosLast(codeParams.scan, dstCoeff, coeffSign, coeffFlag, coeffNum, numSig, g_scan4x4[codeParams.scanType], trSize); + const int cgLastScanPos = (lastScanPos >> LOG2_SCAN_SET_SIZE); + /* TODO: update bit estimates if dirty */ EstBitsSbac& estBitsSbac = m_entropyCoder->m_estBitsSbac; - uint32_t scanPos; - coeffGroupRDStats cgRdStats; + uint32_t scanPos = 0; + uint32_t c1 = 1; + + // process trail all zero Coeff Group + + /* coefficients after lastNZ have no distortion signal cost */ + const int zeroCG = cgNum - 1 - cgLastScanPos; + memset(&costCoeff[(cgLastScanPos + 1) << MLS_CG_SIZE], 0, zeroCG * MLS_CG_BLK_SIZE * sizeof(int64_t)); + memset(&costSig[(cgLastScanPos + 1) << MLS_CG_SIZE], 0, zeroCG * MLS_CG_BLK_SIZE * sizeof(int64_t)); + + /* sum zero coeff (uncodec) cost */ + + // TODO: does we need these cost? + if (usePsyMask) + { + for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++) + { + X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n"); + + uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE); + uint32_t blkPos = codeParams.scan[scanPosBase]; + + // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA + for (int y = 0; y < MLS_CG_SIZE; y++) + { + for (int x = 0; x < MLS_CG_SIZE; x++) + { + int signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */ + int predictedCoef = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/ + + costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits; + + /* when no residual coefficient is coded, predicted coef == recon coef */ + costUncoded[blkPos + x] -= PSYVALUE(predictedCoef); + + totalUncodedCost += costUncoded[blkPos + x]; + totalRdCost += costUncoded[blkPos + x]; + } + blkPos += trSize; + } + } + } + else + { + // non-psy path + for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++) + { + X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n"); + + uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE); + uint32_t blkPos = codeParams.scan[scanPosBase]; + + for (int y = 0; y < MLS_CG_SIZE; y++) + { + for (int x = 0; x < MLS_CG_SIZE; x++) + { + int signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */ + costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits; + + totalUncodedCost += costUncoded[blkPos + x]; + totalRdCost += costUncoded[blkPos + x]; + } + blkPos += trSize; + } + } + } + + static const uint8_t table_cnt[5][SCAN_SET_SIZE] = + { + // patternSigCtx = 0 + { + 2, 1, 1, 0, + 1, 1, 0, 0, + 1, 0, 0, 0, + 0, 0, 0, 0, + }, + // patternSigCtx = 1 + { + 2, 2, 2, 2, + 1, 1, 1, 1, + 0, 0, 0, 0, + 0, 0, 0, 0, + }, + // patternSigCtx = 2 + { + 2, 1, 0, 0, + 2, 1, 0, 0, + 2, 1, 0, 0, + 2, 1, 0, 0, + }, + // patternSigCtx = 3 + { + 2, 2, 2, 2, + 2, 2, 2, 2, + 2, 2, 2, 2, + 2, 2, 2, 2, + }, + // 4x4 + { + 0, 1, 4, 5, + 2, 3, 4, 5, + 6, 6, 8, 8, + 7, 7, 8, 8 + } + }; /* iterate over coding groups in reverse scan order */ - for (int cgScanPos = cgNum - 1; cgScanPos >= 0; cgScanPos--) + for (int cgScanPos = cgLastScanPos; cgScanPos >= 0; cgScanPos--) { + uint32_t ctxSet = (cgScanPos && bIsLuma) ? 2 : 0; const uint32_t cgBlkPos = codeParams.scanCG[cgScanPos]; const uint32_t cgPosY = cgBlkPos >> codeParams.log2TrSizeCG; const uint32_t cgPosX = cgBlkPos - (cgPosY << codeParams.log2TrSizeCG); const uint64_t cgBlkPosMask = ((uint64_t)1 << cgBlkPos); - memset(&cgRdStats, 0, sizeof(coeffGroupRDStats)); + const int patternSigCtx = calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride); + const int ctxSigOffset = codeParams.firstSignificanceMapContext + (cgScanPos && bIsLuma ? 3 : 0); + + if (c1 == 0) + ctxSet++; + c1 = 1; - const int patternSigCtx = calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, codeParams.log2TrSizeCG); + if (cgScanPos && (coeffNum[cgScanPos] == 0)) + { + // TODO: does we need zero-coeff cost? + const uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE); + uint32_t blkPos = codeParams.scan[scanPosBase]; + if (usePsyMask) + { + // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA + for (int y = 0; y < MLS_CG_SIZE; y++) + { + for (int x = 0; x < MLS_CG_SIZE; x++) + { + int signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */ + int predictedCoef = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/ + + costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits; + + /* when no residual coefficient is coded, predicted coef == recon coef */ + costUncoded[blkPos + x] -= PSYVALUE(predictedCoef); + + totalUncodedCost += costUncoded[blkPos + x]; + totalRdCost += costUncoded[blkPos + x]; + + const uint32_t scanPosOffset = y * MLS_CG_SIZE + x; + const uint32_t ctxSig = table_cnt[patternSigCtx][g_scan4x4[codeParams.scanType][scanPosOffset]] + ctxSigOffset; + X265_CHECK(trSize > 4, "trSize check failure\n"); + X265_CHECK(ctxSig == getSigCtxInc(patternSigCtx, log2TrSize, trSize, codeParams.scan[scanPosBase + scanPosOffset], bIsLuma, codeParams.firstSignificanceMapContext), "sigCtx check failure\n"); + + costSig[scanPosBase + scanPosOffset] = SIGCOST(estBitsSbac.significantBits[0][ctxSig]); + costCoeff[scanPosBase + scanPosOffset] = costUncoded[blkPos + x]; + sigRateDelta[blkPos + x] = estBitsSbac.significantBits[1][ctxSig] - estBitsSbac.significantBits[0][ctxSig]; + } + blkPos += trSize; + } + } + else + { + // non-psy path + for (int y = 0; y < MLS_CG_SIZE; y++) + { + for (int x = 0; x < MLS_CG_SIZE; x++) + { + int signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */ + costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits; + + totalUncodedCost += costUncoded[blkPos + x]; + totalRdCost += costUncoded[blkPos + x]; + + const uint32_t scanPosOffset = y * MLS_CG_SIZE + x; + const uint32_t ctxSig = table_cnt[patternSigCtx][g_scan4x4[codeParams.scanType][scanPosOffset]] + ctxSigOffset; + X265_CHECK(trSize > 4, "trSize check failure\n"); + X265_CHECK(ctxSig == getSigCtxInc(patternSigCtx, log2TrSize, trSize, codeParams.scan[scanPosBase + scanPosOffset], bIsLuma, codeParams.firstSignificanceMapContext), "sigCtx check failure\n"); + + costSig[scanPosBase + scanPosOffset] = SIGCOST(estBitsSbac.significantBits[0][ctxSig]); + costCoeff[scanPosBase + scanPosOffset] = costUncoded[blkPos + x]; + sigRateDelta[blkPos + x] = estBitsSbac.significantBits[1][ctxSig] - estBitsSbac.significantBits[0][ctxSig]; + } + blkPos += trSize; + } + } + + /* there were no coded coefficients in this coefficient group */ + { + uint32_t ctxSig = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride); + costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[ctxSig][0]); + totalRdCost += costCoeffGroupSig[cgScanPos]; /* add cost of 0 bit in significant CG bitmap */ + } + continue; + } + + coeffGroupRDStats cgRdStats; + memset(&cgRdStats, 0, sizeof(coeffGroupRDStats)); + + uint32_t subFlagMask = coeffFlag[cgScanPos]; + int c2 = 0; + uint32_t goRiceParam = 0; + uint32_t c1Idx = 0; + uint32_t c2Idx = 0; /* iterate over coefficients in each group in reverse scan order */ for (int scanPosinCG = cgSize - 1; scanPosinCG >= 0; scanPosinCG--) { scanPos = (cgScanPos << MLS_CG_SIZE) + scanPosinCG; uint32_t blkPos = codeParams.scan[scanPos]; - uint16_t maxAbsLevel = (int16_t)abs(dstCoeff[blkPos]); /* abs(quantized coeff) */ + uint32_t maxAbsLevel = abs(dstCoeff[blkPos]); /* abs(quantized coeff) */ int signCoef = m_resiDctCoeff[blkPos]; /* pre-quantization DCT coeff */ int predictedCoef = m_fencDctCoeff[blkPos] - signCoef; /* predicted DCT = source DCT - residual DCT*/ @@ -611,22 +806,21 @@ * FIX15 nature of the CABAC cost tables minus the forward transform scale */ /* cost of not coding this coefficient (all distortion, no signal bits) */ - costUncoded[scanPos] = (int64_t)(signCoef * signCoef) << scaleBits; - if (usePsy && blkPos) + costUncoded[blkPos] = ((int64_t)signCoef * signCoef) << scaleBits; + X265_CHECK((!!scanPos ^ !!blkPos) == 0, "failed on (blkPos=0 && scanPos!=0)\n"); + if (usePsyMask & scanPos) /* when no residual coefficient is coded, predicted coef == recon coef */ - costUncoded[scanPos] -= PSYVALUE(predictedCoef); + costUncoded[blkPos] -= PSYVALUE(predictedCoef); - totalUncodedCost += costUncoded[scanPos]; + totalUncodedCost += costUncoded[blkPos]; - if (maxAbsLevel && lastScanPos < 0) - { - /* remember the first non-zero coef found in this reverse scan as the last pos */ - lastScanPos = scanPos; - ctxSet = (scanPos < SCAN_SET_SIZE || !bIsLuma) ? 0 : 2; - cgLastScanPos = cgScanPos; - } + // coefficient level estimation + const int* greaterOneBits = estBitsSbac.greaterOneBits[4 * ctxSet + c1]; + const uint32_t ctxSig = (blkPos == 0) ? 0 : table_cnt[(trSize == 4) ? 4 : patternSigCtx][g_scan4x4[codeParams.scanType][scanPosinCG]] + ctxSigOffset; + X265_CHECK(ctxSig == getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codeParams.firstSignificanceMapContext), "sigCtx check failure\n"); - if (lastScanPos < 0) + // before find lastest non-zero coeff + if (scanPos > (uint32_t)lastScanPos) { /* coefficients after lastNZ have no distortion signal cost */ costCoeff[scanPos] = 0; @@ -635,10 +829,24 @@ /* No non-zero coefficient yet found, but this does not mean * there is no uncoded-cost for this coefficient. Pre- * quantization the coefficient may have been non-zero */ - totalRdCost += costUncoded[scanPos]; + totalRdCost += costUncoded[blkPos]; + } + else if (!(subFlagMask & 1)) + { + // fast zero coeff path + /* set default costs to uncoded costs */ + costSig[scanPos] = SIGCOST(estBitsSbac.significantBits[0][ctxSig]); + costCoeff[scanPos] = costUncoded[blkPos] + costSig[scanPos]; + sigRateDelta[blkPos] = estBitsSbac.significantBits[1][ctxSig] - estBitsSbac.significantBits[0][ctxSig]; + totalRdCost += costCoeff[scanPos]; + rateIncUp[blkPos] = greaterOneBits[0]; + + subFlagMask >>= 1; } else { + subFlagMask >>= 1; + const uint32_t c1c2Idx = ((c1Idx - 8) >> (sizeof(int) * CHAR_BIT - 1)) + (((-(int)c2Idx) >> (sizeof(int) * CHAR_BIT - 1)) + 1) * 2; const uint32_t baseLevel = ((uint32_t)0xD9 >> (c1c2Idx * 2)) & 3; // {1, 2, 1, 3} @@ -647,12 +855,9 @@ X265_CHECK((int)baseLevel == ((c1Idx < C1FLAG_NUMBER) ? (2 + (c2Idx == 0)) : 1), "scan validation 3\n"); // coefficient level estimation - const uint32_t oneCtx = 4 * ctxSet + c1; - const uint32_t absCtx = ctxSet + c2; - const int* greaterOneBits = estBitsSbac.greaterOneBits[oneCtx]; - const int* levelAbsBits = estBitsSbac.levelAbsBits[absCtx]; + const int* levelAbsBits = estBitsSbac.levelAbsBits[ctxSet + c2]; - uint16_t level = 0; + uint32_t level = 0; uint32_t sigCoefBits = 0; costCoeff[scanPos] = MAX_INT64; @@ -660,48 +865,82 @@ sigRateDelta[blkPos] = 0; else { - const uint32_t ctxSig = getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codeParams.firstSignificanceMapContext); if (maxAbsLevel < 3) { /* set default costs to uncoded costs */ - costSig[scanPos] = SIGCOST(estBitsSbac.significantBits[ctxSig][0]); - costCoeff[scanPos] = costUncoded[scanPos] + costSig[scanPos]; + costSig[scanPos] = SIGCOST(estBitsSbac.significantBits[0][ctxSig]); + costCoeff[scanPos] = costUncoded[blkPos] + costSig[scanPos]; } - sigRateDelta[blkPos] = estBitsSbac.significantBits[ctxSig][1] - estBitsSbac.significantBits[ctxSig][0]; - sigCoefBits = estBitsSbac.significantBits[ctxSig][1]; + sigRateDelta[blkPos] = estBitsSbac.significantBits[1][ctxSig] - estBitsSbac.significantBits[0][ctxSig]; + sigCoefBits = estBitsSbac.significantBits[1][ctxSig]; } - if (maxAbsLevel) + + // NOTE: X265_MAX(maxAbsLevel - 1, 1) ==> (X>=2 -> X-1), (X<2 -> 1) | (0 < X < 2 ==> X=1) + if (maxAbsLevel == 1) { - uint16_t minAbsLevel = X265_MAX(maxAbsLevel - 1, 1); - for (uint16_t lvl = maxAbsLevel; lvl >= minAbsLevel; lvl--) + uint32_t levelBits = (c1c2Idx & 1) ? greaterOneBits[0] + IEP_RATE : ((1 + goRiceParam) << 15) + IEP_RATE; + X265_CHECK(levelBits == getICRateCost(1, 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE, "levelBits mistake\n"); + + int unquantAbsLevel = UNQUANT(1); + int d = abs(signCoef) - unquantAbsLevel; + int64_t curCost = RDCOST(d, sigCoefBits + levelBits); + + /* Psy RDOQ: bias in favor of higher AC coefficients in the reconstructed frame */ + if (usePsyMask & scanPos) { - uint32_t levelBits = getICRateCost(lvl, lvl - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE; + int reconCoef = abs(unquantAbsLevel + SIGN(predictedCoef, signCoef)); + curCost -= PSYVALUE(reconCoef); + } - int unquantAbsLevel = UNQUANT(lvl); - int d = abs(signCoef) - unquantAbsLevel; - int64_t curCost = RDCOST(d, sigCoefBits + levelBits); + if (curCost < costCoeff[scanPos]) + { + level = 1; + costCoeff[scanPos] = curCost; + costSig[scanPos] = SIGCOST(sigCoefBits); + } + } + else if (maxAbsLevel) + { + uint32_t levelBits0 = getICRateCost(maxAbsLevel, maxAbsLevel - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE; + uint32_t levelBits1 = getICRateCost(maxAbsLevel - 1, maxAbsLevel - 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE; - /* Psy RDOQ: bias in favor of higher AC coefficients in the reconstructed frame */ - if (usePsy && blkPos) - { - int reconCoef = abs(unquantAbsLevel + SIGN(predictedCoef, signCoef)); - curCost -= PSYVALUE(reconCoef); - } + int unquantAbsLevel0 = UNQUANT(maxAbsLevel); + int d0 = abs(signCoef) - unquantAbsLevel0; + int64_t curCost0 = RDCOST(d0, sigCoefBits + levelBits0); - if (curCost < costCoeff[scanPos]) - { - level = lvl; - costCoeff[scanPos] = curCost; - costSig[scanPos] = SIGCOST(sigCoefBits); - } + int unquantAbsLevel1 = UNQUANT(maxAbsLevel - 1); + int d1 = abs(signCoef) - unquantAbsLevel1; + int64_t curCost1 = RDCOST(d1, sigCoefBits + levelBits1); + + /* Psy RDOQ: bias in favor of higher AC coefficients in the reconstructed frame */ + if (usePsyMask & scanPos) + { + int reconCoef; + reconCoef = abs(unquantAbsLevel0 + SIGN(predictedCoef, signCoef)); + curCost0 -= PSYVALUE(reconCoef); + + reconCoef = abs(unquantAbsLevel1 + SIGN(predictedCoef, signCoef)); + curCost1 -= PSYVALUE(reconCoef); + } + if (curCost0 < costCoeff[scanPos]) + { + level = maxAbsLevel; + costCoeff[scanPos] = curCost0; + costSig[scanPos] = SIGCOST(sigCoefBits); + } + if (curCost1 < costCoeff[scanPos]) + { + level = maxAbsLevel - 1; + costCoeff[scanPos] = curCost1; + costSig[scanPos] = SIGCOST(sigCoefBits); } } - dstCoeff[blkPos] = level; + dstCoeff[blkPos] = (int16_t)level; totalRdCost += costCoeff[scanPos]; /* record costs for sign-hiding performed at the end */ - if (level) + if ((cu.m_slice->m_pps->bSignHideEnabled ? ~0 : 0) & level) { const int32_t diff0 = level - 1 - baseLevel; const int32_t diff2 = level + 1 - baseLevel; @@ -763,41 +1002,27 @@ else if ((c1 < 3) && (c1 > 0) && level) c1++; - /* context set update */ - if (!(scanPos % SCAN_SET_SIZE) && scanPos) + if (dstCoeff[blkPos]) { - c2 = 0; - goRiceParam = 0; - - c1Idx = 0; - c2Idx = 0; - ctxSet = (scanPos == SCAN_SET_SIZE || !bIsLuma) ? 0 : 2; - X265_CHECK(c1 >= 0, "c1 is negative\n"); - ctxSet -= ((int32_t)(c1 - 1) >> 31); - c1 = 1; + sigCoeffGroupFlag64 |= cgBlkPosMask; + cgRdStats.codedLevelAndDist += costCoeff[scanPos] - costSig[scanPos]; + cgRdStats.uncodedDist += costUncoded[blkPos]; + cgRdStats.nnzBeforePos0 += scanPosinCG; } } cgRdStats.sigCost += costSig[scanPos]; - if (!scanPosinCG) - cgRdStats.sigCost0 = costSig[scanPos]; - - if (dstCoeff[blkPos]) - { - sigCoeffGroupFlag64 |= cgBlkPosMask; - cgRdStats.codedLevelAndDist += costCoeff[scanPos] - costSig[scanPos]; - cgRdStats.uncodedDist += costUncoded[scanPos]; - cgRdStats.nnzBeforePos0 += scanPosinCG; - } } /* end for (scanPosinCG) */ + X265_CHECK((cgScanPos << MLS_CG_SIZE) == (int)scanPos, "scanPos mistake\n"); + cgRdStats.sigCost0 = costSig[scanPos]; + costCoeffGroupSig[cgScanPos] = 0; - if (cgLastScanPos < 0) - { - /* nothing to do at this point */ - } - else if (!cgScanPos || cgScanPos == cgLastScanPos) + /* nothing to do at this case */ + X265_CHECK(cgLastScanPos >= 0, "cgLastScanPos check failure\n"); + + if (!cgScanPos || cgScanPos == cgLastScanPos) { /* coeff group 0 is implied to be present, no signal cost */ /* coeff group with last NZ is implied to be present, handled below */ @@ -815,7 +1040,7 @@ * of the significant coefficient group flag and evaluate whether the RD cost of the * coded group is more than the RD cost of the uncoded group */ - uint32_t sigCtx = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, codeParams.log2TrSizeCG); + uint32_t sigCtx = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride); int64_t costZeroCG = totalRdCost + SIGCOST(estBitsSbac.significantCoeffGroupBits[sigCtx][0]); costZeroCG += cgRdStats.uncodedDist; /* add distortion for resetting non-zero levels to zero levels */ @@ -832,23 +1057,17 @@ costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[sigCtx][0]); /* reset all coeffs to 0. UNCODE THIS COEFF GROUP! */ - for (int scanPosinCG = cgSize - 1; scanPosinCG >= 0; scanPosinCG--) - { - scanPos = cgScanPos * cgSize + scanPosinCG; - uint32_t blkPos = codeParams.scan[scanPos]; - if (dstCoeff[blkPos]) - { - costCoeff[scanPos] = costUncoded[scanPos]; - costSig[scanPos] = 0; - } - dstCoeff[blkPos] = 0; - } + const uint32_t blkPos = codeParams.scan[cgScanPos * cgSize]; + memset(&dstCoeff[blkPos + 0 * trSize], 0, 4 * sizeof(*dstCoeff)); + memset(&dstCoeff[blkPos + 1 * trSize], 0, 4 * sizeof(*dstCoeff)); + memset(&dstCoeff[blkPos + 2 * trSize], 0, 4 * sizeof(*dstCoeff)); + memset(&dstCoeff[blkPos + 3 * trSize], 0, 4 * sizeof(*dstCoeff)); } } else { /* there were no coded coefficients in this coefficient group */ - uint32_t ctxSig = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, codeParams.log2TrSizeCG); + uint32_t ctxSig = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride); costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[ctxSig][0]); totalRdCost += costCoeffGroupSig[cgScanPos]; /* add cost of 0 bit in significant CG bitmap */ totalRdCost -= cgRdStats.sigCost; /* remove cost of significant coefficient bitmap */ @@ -909,7 +1128,7 @@ * cost of signaling it as not-significant */ uint32_t blkPos = codeParams.scan[scanPos]; if (dstCoeff[blkPos]) - { + { // Calculates the cost of signaling the last significant coefficient in the block uint32_t pos[2] = { (blkPos & (trSize - 1)), (blkPos >> log2TrSize) }; if (codeParams.scanType == SCAN_VER) @@ -940,7 +1159,7 @@ } totalRdCost -= costCoeff[scanPos]; - totalRdCost += costUncoded[scanPos]; + totalRdCost += costUncoded[blkPos]; } else totalRdCost -= costSig[scanPos]; @@ -959,34 +1178,40 @@ dstCoeff[blkPos] = (int16_t)((level ^ mask) - mask); } + // Average 49.62 pixels /* clean uncoded coefficients */ - for (int pos = bestLastIdx; pos <= lastScanPos; pos++) + for (int pos = bestLastIdx; pos <= fastMin(lastScanPos, (bestLastIdx | (SCAN_SET_SIZE - 1))); pos++) + { dstCoeff[codeParams.scan[pos]] = 0; + } + for (int pos = (bestLastIdx & ~(SCAN_SET_SIZE - 1)) + SCAN_SET_SIZE; pos <= lastScanPos; pos += SCAN_SET_SIZE) + { + const uint32_t blkPos = codeParams.scan[pos]; + memset(&dstCoeff[blkPos + 0 * trSize], 0, 4 * sizeof(*dstCoeff)); + memset(&dstCoeff[blkPos + 1 * trSize], 0, 4 * sizeof(*dstCoeff)); + memset(&dstCoeff[blkPos + 2 * trSize], 0, 4 * sizeof(*dstCoeff)); + memset(&dstCoeff[blkPos + 3 * trSize], 0, 4 * sizeof(*dstCoeff)); + } /* rate-distortion based sign-hiding */ if (cu.m_slice->m_pps->bSignHideEnabled && numSig >= 2) { + const int realLastScanPos = (bestLastIdx - 1) >> LOG2_SCAN_SET_SIZE; int lastCG = true; - for (int subSet = cgLastScanPos; subSet >= 0; subSet--) + for (int subSet = realLastScanPos; subSet >= 0; subSet--) { int subPos = subSet << LOG2_SCAN_SET_SIZE; int n; - /* measure distance between first and last non-zero coef in this - * coding group */ - for (n = SCAN_SET_SIZE - 1; n >= 0; --n) - if (dstCoeff[codeParams.scan[n + subPos]]) - break; - if (n < 0) + if (!(sigCoeffGroupFlag64 & (1ULL << codeParams.scanCG[subSet]))) continue; - int lastNZPosInCG = n; - - for (n = 0;; n++) - if (dstCoeff[codeParams.scan[n + subPos]]) - break; + /* measure distance between first and last non-zero coef in this + * coding group */ + const uint32_t posFirstLast = primitives.findPosFirstLast(&dstCoeff[codeParams.scan[subPos]], trSize, g_scan4x4[codeParams.scanType]); + int firstNZPosInCG = (uint16_t)posFirstLast; + int lastNZPosInCG = posFirstLast >> 16; - int firstNZPosInCG = n; if (lastNZPosInCG - firstNZPosInCG >= SBH_THRESHOLD) { @@ -1092,22 +1317,6 @@ return numSig; } -/* Pattern decision for context derivation process of significant_coeff_flag */ -uint32_t Quant::calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG) -{ - if (!log2TrSizeCG) - return 0; - - const uint32_t trSizeCG = 1 << log2TrSizeCG; - X265_CHECK(trSizeCG <= 8, "transform CG is too large\n"); - const uint32_t shift = (cgPosY << log2TrSizeCG) + cgPosX + 1; - const uint32_t sigPos = (uint32_t)(shift >= 64 ? 0 : sigCoeffGroupFlag64 >> shift); - const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & (sigPos & 1); - const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 2)) & 2; - - return sigRight + sigLower; -} - /* Context derivation process of coeff_abs_significant_flag */ uint32_t Quant::getSigCtxInc(uint32_t patternSigCtx, uint32_t log2TrSize, uint32_t trSize, uint32_t blkPos, bool bIsLuma, uint32_t firstSignificanceMapContext) @@ -1175,14 +1384,3 @@ return (bIsLuma && (posX | posY) >= 4) ? 3 + offset : offset; } -/* Context derivation process of coeff_abs_significant_flag */ -uint32_t Quant::getSigCoeffGroupCtxInc(uint64_t cgGroupMask, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG) -{ - const uint32_t trSizeCG = 1 << log2TrSizeCG; - - const uint32_t sigPos = (uint32_t)(cgGroupMask >> (1 + (cgPosY << log2TrSizeCG) + cgPosX)); - const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos; - const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1)); - - return (sigRight | sigLower) & 1; -}
View file
x265_1.6.tar.gz/source/common/quant.h -> x265_1.7.tar.gz/source/common/quant.h
Changed
@@ -41,7 +41,7 @@ int per; int qp; int64_t lambda2; /* FIX8 */ - int64_t lambda; /* FIX8 */ + int32_t lambda; /* FIX8, dynamic range is 18-bits in 8bpp and 20-bits in 16bpp */ QpParam() : qp(MAX_INT) {} @@ -53,7 +53,8 @@ per = qpScaled / 6; qp = qpScaled; lambda2 = (int64_t)(x265_lambda2_tab[qp - QP_BD_OFFSET] * 256. + 0.5); - lambda = (int64_t)(x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5); + lambda = (int32_t)(x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5); + X265_CHECK((x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5) < (double)MAX_INT, "x265_lambda_tab[] value too large\n"); } } }; @@ -82,7 +83,7 @@ QpParam m_qpParam[3]; int m_rdoqLevel; - int64_t m_psyRdoqScale; + int32_t m_psyRdoqScale; // dynamic range [0,50] * 256 = 14-bits int16_t* m_resiDctCoeff; int16_t* m_fencDctCoeff; int16_t* m_fencShortBuf; @@ -103,7 +104,7 @@ bool allocNoiseReduction(const x265_param& param); /* CU setup */ - void setQPforQuant(const CUData& cu); + void setQPforQuant(const CUData& ctu, int qp); uint32_t transformNxN(const CUData& cu, const pixel* fenc, uint32_t fencStride, const int16_t* residual, uint32_t resiStride, coeff_t* coeff, uint32_t log2TrSize, TextType ttype, uint32_t absPartIdx, bool useTransformSkip); @@ -111,10 +112,39 @@ void invtransformNxN(int16_t* residual, uint32_t resiStride, const coeff_t* coeff, uint32_t log2TrSize, TextType ttype, bool bIntra, bool useTransformSkip, uint32_t numSig); + /* Pattern decision for context derivation process of significant_coeff_flag */ + static uint32_t calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t cgBlkPos, uint32_t trSizeCG) + { + if (trSizeCG == 1) + return 0; + + X265_CHECK(trSizeCG <= 8, "transform CG is too large\n"); + X265_CHECK(cgBlkPos < 64, "cgBlkPos is too large\n"); + // NOTE: cgBlkPos+1 may more than 63, it is invalid for shift, + // but in this case, both cgPosX and cgPosY equal to (trSizeCG - 1), + // the sigRight and sigLower will clear value to zero, the final result will be correct + const uint32_t sigPos = (uint32_t)(sigCoeffGroupFlag64 >> (cgBlkPos + 1)); // just need lowest 7-bits valid + + // TODO: instruction BT is faster, but _bittest64 still generate instruction 'BT m, r' in VS2012 + const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & (sigPos & 1); + const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 2)) & 2; + return sigRight + sigLower; + } + + /* Context derivation process of coeff_abs_significant_flag */ + static uint32_t getSigCoeffGroupCtxInc(uint64_t cgGroupMask, uint32_t cgPosX, uint32_t cgPosY, uint32_t cgBlkPos, uint32_t trSizeCG) + { + X265_CHECK(cgBlkPos < 64, "cgBlkPos is too large\n"); + // NOTE: unsafe shift operator, see NOTE in calcPatternSigCtx + const uint32_t sigPos = (uint32_t)(cgGroupMask >> (cgBlkPos + 1)); // just need lowest 8-bits valid + const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos; + const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1)); + + return (sigRight | sigLower) & 1; + } + /* static methods shared with entropy.cpp */ - static uint32_t calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG); static uint32_t getSigCtxInc(uint32_t patternSigCtx, uint32_t log2TrSize, uint32_t trSize, uint32_t blkPos, bool bIsLuma, uint32_t firstSignificanceMapContext); - static uint32_t getSigCoeffGroupCtxInc(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG); protected:
View file
x265_1.6.tar.gz/source/common/slice.h -> x265_1.7.tar.gz/source/common/slice.h
Changed
@@ -98,6 +98,7 @@ LEVEL6 = 180, LEVEL6_1 = 183, LEVEL6_2 = 186, + LEVEL8_5 = 255, }; }
View file
x265_1.6.tar.gz/source/common/threading.h -> x265_1.7.tar.gz/source/common/threading.h
Changed
@@ -189,6 +189,14 @@ LeaveCriticalSection(&m_cs); } + void poke(void) + { + /* awaken all waiting threads, but make no change */ + EnterCriticalSection(&m_cs); + WakeAllConditionVariable(&m_cv); + LeaveCriticalSection(&m_cs); + } + void incr() { EnterCriticalSection(&m_cs); @@ -370,6 +378,14 @@ pthread_mutex_unlock(&m_mutex); } + void poke(void) + { + /* awaken all waiting threads, but make no change */ + pthread_mutex_lock(&m_mutex); + pthread_cond_broadcast(&m_cond); + pthread_mutex_unlock(&m_mutex); + } + void incr() { pthread_mutex_lock(&m_mutex);
View file
x265_1.6.tar.gz/source/common/threadpool.cpp -> x265_1.7.tar.gz/source/common/threadpool.cpp
Changed
@@ -232,7 +232,7 @@ int cpuCount = getCpuCount(); bool bNumaSupport = false; -#if _WIN32_WINNT >= 0x0601 +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 bNumaSupport = true; #elif HAVE_LIBNUMA bNumaSupport = numa_available() >= 0; @@ -241,10 +241,10 @@ for (int i = 0; i < cpuCount; i++) { -#if _WIN32_WINNT >= 0x0601 +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 UCHAR node; if (GetNumaProcessorNode((UCHAR)i, &node)) - cpusPerNode[X265_MIN(node, MAX_NODE_NUM)]++; + cpusPerNode[X265_MIN(node, (UCHAR)MAX_NODE_NUM)]++; else #elif HAVE_LIBNUMA if (bNumaSupport >= 0) @@ -261,7 +261,7 @@ /* limit nodes based on param->numaPools */ if (p->numaPools && *p->numaPools) { - char *nodeStr = p->numaPools; + const char *nodeStr = p->numaPools; for (int i = 0; i < numNumaNodes; i++) { if (!*nodeStr) @@ -373,7 +373,7 @@ return true; } -void ThreadPool::stop() +void ThreadPool::stopWorkers() { if (m_workers) { @@ -408,7 +408,7 @@ /* static */ void ThreadPool::setThreadNodeAffinity(int numaNode) { -#if _WIN32_WINNT >= 0x0601 +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 GROUP_AFFINITY groupAffinity; if (GetNumaNodeProcessorMaskEx((USHORT)numaNode, &groupAffinity)) { @@ -433,7 +433,7 @@ /* static */ int ThreadPool::getNumaNodeCount() { -#if _WIN32_WINNT >= 0x0601 +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 ULONG num = 1; if (GetNumaHighestNodeNumber(&num)) num++;
View file
x265_1.6.tar.gz/source/common/threadpool.h -> x265_1.7.tar.gz/source/common/threadpool.h
Changed
@@ -94,7 +94,7 @@ bool create(int numThreads, int maxProviders, int node); bool start(); - void stop(); + void stopWorkers(); void setCurrentThreadAffinity(); int tryAcquireSleepingThread(sleepbitmap_t firstTryBitmap, sleepbitmap_t secondTryBitmap); int tryBondPeers(int maxPeers, sleepbitmap_t peerBitmap, BondedTaskGroup& master);
View file
x265_1.6.tar.gz/source/common/x86/asm-primitives.cpp -> x265_1.7.tar.gz/source/common/x86/asm-primitives.cpp
Changed
@@ -800,6 +800,10 @@ #error "Unsupported build configuration (32bit x86 and HIGH_BIT_DEPTH), you must configure ENABLE_ASSEMBLY=OFF" #endif +#if X86_64 + p.scanPosLast = x265_scanPosLast_x64; +#endif + if (cpuMask & X265_CPU_SSE2) { /* We do not differentiate CPUs which support MMX and not SSE2. We only check @@ -859,9 +863,6 @@ PIXEL_AVG_W4(mmx2); LUMA_VAR(sse2); - p.luma_p2s = x265_luma_p2s_sse2; - p.chroma[X265_CSP_I420].p2s = x265_chroma_p2s_sse2; - p.chroma[X265_CSP_I422].p2s = x265_chroma_p2s_sse2; ALL_LUMA_TU(blockfill_s, blockfill_s, sse2); ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, sse2); @@ -872,15 +873,41 @@ ALL_LUMA_TU_S(calcresidual, getResidual, sse2); ALL_LUMA_TU_S(transpose, transpose, sse2); - p.cu[BLOCK_4x4].intra_pred[DC_IDX] = x265_intra_pred_dc4_sse2; - p.cu[BLOCK_8x8].intra_pred[DC_IDX] = x265_intra_pred_dc8_sse2; - p.cu[BLOCK_16x16].intra_pred[DC_IDX] = x265_intra_pred_dc16_sse2; - p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_sse2; - - p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2; - p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = x265_intra_pred_planar8_sse2; - p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_sse2; - p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_sse2; + ALL_LUMA_TU_S(intra_pred[PLANAR_IDX], intra_pred_planar, sse2); + ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse2); + + p.cu[BLOCK_4x4].intra_pred[2] = x265_intra_pred_ang4_2_sse2; + p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_sse2; + p.cu[BLOCK_4x4].intra_pred[4] = x265_intra_pred_ang4_4_sse2; + p.cu[BLOCK_4x4].intra_pred[5] = x265_intra_pred_ang4_5_sse2; + p.cu[BLOCK_4x4].intra_pred[6] = x265_intra_pred_ang4_6_sse2; + p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_sse2; + p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_sse2; + p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_sse2; + p.cu[BLOCK_4x4].intra_pred[10] = x265_intra_pred_ang4_10_sse2; + p.cu[BLOCK_4x4].intra_pred[11] = x265_intra_pred_ang4_11_sse2; + p.cu[BLOCK_4x4].intra_pred[12] = x265_intra_pred_ang4_12_sse2; + p.cu[BLOCK_4x4].intra_pred[13] = x265_intra_pred_ang4_13_sse2; + p.cu[BLOCK_4x4].intra_pred[14] = x265_intra_pred_ang4_14_sse2; + p.cu[BLOCK_4x4].intra_pred[15] = x265_intra_pred_ang4_15_sse2; + p.cu[BLOCK_4x4].intra_pred[16] = x265_intra_pred_ang4_16_sse2; + p.cu[BLOCK_4x4].intra_pred[17] = x265_intra_pred_ang4_17_sse2; + p.cu[BLOCK_4x4].intra_pred[18] = x265_intra_pred_ang4_18_sse2; + p.cu[BLOCK_4x4].intra_pred[19] = x265_intra_pred_ang4_17_sse2; + p.cu[BLOCK_4x4].intra_pred[20] = x265_intra_pred_ang4_16_sse2; + p.cu[BLOCK_4x4].intra_pred[21] = x265_intra_pred_ang4_15_sse2; + p.cu[BLOCK_4x4].intra_pred[22] = x265_intra_pred_ang4_14_sse2; + p.cu[BLOCK_4x4].intra_pred[23] = x265_intra_pred_ang4_13_sse2; + p.cu[BLOCK_4x4].intra_pred[24] = x265_intra_pred_ang4_12_sse2; + p.cu[BLOCK_4x4].intra_pred[25] = x265_intra_pred_ang4_11_sse2; + p.cu[BLOCK_4x4].intra_pred[26] = x265_intra_pred_ang4_26_sse2; + p.cu[BLOCK_4x4].intra_pred[27] = x265_intra_pred_ang4_9_sse2; + p.cu[BLOCK_4x4].intra_pred[28] = x265_intra_pred_ang4_8_sse2; + p.cu[BLOCK_4x4].intra_pred[29] = x265_intra_pred_ang4_7_sse2; + p.cu[BLOCK_4x4].intra_pred[30] = x265_intra_pred_ang4_6_sse2; + p.cu[BLOCK_4x4].intra_pred[31] = x265_intra_pred_ang4_5_sse2; + p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_4_sse2; + p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_3_sse2; p.cu[BLOCK_4x4].sse_ss = x265_pixel_ssd_ss_4x4_mmx2; ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2); @@ -918,6 +945,74 @@ p.cu[BLOCK_16x16].count_nonzero = x265_count_nonzero_16x16_ssse3; p.cu[BLOCK_32x32].count_nonzero = x265_count_nonzero_32x32_ssse3; p.frameInitLowres = x265_frame_init_lowres_core_ssse3; + + p.pu[LUMA_4x4].convert_p2s = x265_filterPixelToShort_4x4_ssse3; + p.pu[LUMA_4x8].convert_p2s = x265_filterPixelToShort_4x8_ssse3; + p.pu[LUMA_4x16].convert_p2s = x265_filterPixelToShort_4x16_ssse3; + p.pu[LUMA_8x4].convert_p2s = x265_filterPixelToShort_8x4_ssse3; + p.pu[LUMA_8x8].convert_p2s = x265_filterPixelToShort_8x8_ssse3; + p.pu[LUMA_8x16].convert_p2s = x265_filterPixelToShort_8x16_ssse3; + p.pu[LUMA_8x32].convert_p2s = x265_filterPixelToShort_8x32_ssse3; + p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_ssse3; + p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_ssse3; + p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_ssse3; + p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_ssse3; + p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_ssse3; + p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_ssse3; + p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_ssse3; + p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_ssse3; + p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_ssse3; + p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_ssse3; + p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_ssse3; + p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_ssse3; + p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_ssse3; + p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_ssse3; + p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_ssse3; + p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_ssse3; + p.pu[LUMA_12x16].convert_p2s = x265_filterPixelToShort_12x16_ssse3; + p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_ssse3; + + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s = x265_filterPixelToShort_4x4_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s = x265_filterPixelToShort_4x8_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s = x265_filterPixelToShort_4x16_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s = x265_filterPixelToShort_8x4_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s = x265_filterPixelToShort_8x8_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s = x265_filterPixelToShort_8x16_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s = x265_filterPixelToShort_8x32_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = x265_filterPixelToShort_16x4_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = x265_filterPixelToShort_16x8_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = x265_filterPixelToShort_16x12_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = x265_filterPixelToShort_16x16_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = x265_filterPixelToShort_16x32_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s = x265_filterPixelToShort_4x4_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s = x265_filterPixelToShort_4x8_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s = x265_filterPixelToShort_4x16_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s = x265_filterPixelToShort_4x32_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s = x265_filterPixelToShort_8x4_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s = x265_filterPixelToShort_8x8_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s = x265_filterPixelToShort_8x12_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s = x265_filterPixelToShort_8x16_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s = x265_filterPixelToShort_8x32_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s = x265_filterPixelToShort_8x64_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s = x265_filterPixelToShort_12x32_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = x265_filterPixelToShort_16x8_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = x265_filterPixelToShort_16x16_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = x265_filterPixelToShort_16x24_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = x265_filterPixelToShort_16x32_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = x265_filterPixelToShort_16x64_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s = x265_filterPixelToShort_4x2_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = x265_filterPixelToShort_8x2_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = x265_filterPixelToShort_8x6_ssse3; + p.findPosFirstLast = x265_findPosFirstLast_ssse3; } if (cpuMask & X265_CPU_SSE4) { @@ -957,6 +1052,13 @@ ALL_LUMA_TU_S(copy_cnt, copy_cnt_, sse4); ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4); ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = x265_filterPixelToShort_2x4_sse4; + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = x265_filterPixelToShort_2x8_sse4; + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s = x265_filterPixelToShort_6x8_sse4; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = x265_filterPixelToShort_2x8_sse4; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = x265_filterPixelToShort_2x16_sse4; + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = x265_filterPixelToShort_6x16_sse4; } if (cpuMask & X265_CPU_AVX) { @@ -1079,6 +1181,26 @@ } if (cpuMask & X265_CPU_AVX2) { + p.pu[LUMA_48x64].satd = x265_pixel_satd_48x64_avx2; + + p.pu[LUMA_64x16].satd = x265_pixel_satd_64x16_avx2; + p.pu[LUMA_64x32].satd = x265_pixel_satd_64x32_avx2; + p.pu[LUMA_64x48].satd = x265_pixel_satd_64x48_avx2; + p.pu[LUMA_64x64].satd = x265_pixel_satd_64x64_avx2; + + p.pu[LUMA_32x8].satd = x265_pixel_satd_32x8_avx2; + p.pu[LUMA_32x16].satd = x265_pixel_satd_32x16_avx2; + p.pu[LUMA_32x24].satd = x265_pixel_satd_32x24_avx2; + p.pu[LUMA_32x32].satd = x265_pixel_satd_32x32_avx2; + p.pu[LUMA_32x64].satd = x265_pixel_satd_32x64_avx2; + + p.pu[LUMA_16x4].satd = x265_pixel_satd_16x4_avx2; + p.pu[LUMA_16x8].satd = x265_pixel_satd_16x8_avx2; + p.pu[LUMA_16x12].satd = x265_pixel_satd_16x12_avx2; + p.pu[LUMA_16x16].satd = x265_pixel_satd_16x16_avx2; + p.pu[LUMA_16x32].satd = x265_pixel_satd_16x32_avx2; + p.pu[LUMA_16x64].satd = x265_pixel_satd_16x64_avx2; + p.cu[BLOCK_32x32].ssd_s = x265_pixel_ssd_s_32_avx2; p.cu[BLOCK_16x16].sse_ss = x265_pixel_ssd_ss_16x16_avx2; @@ -1087,6 +1209,7 @@ p.dequant_normal = x265_dequant_normal_avx2; p.scale1D_128to64 = x265_scale1D_128to64_avx2; + p.scale2D_64to32 = x265_scale2D_64to32_avx2; // p.weight_pp = x265_weight_pp_avx2; fails tests p.cu[BLOCK_16x16].calcresidual = x265_getResidual16_avx2; @@ -1119,12 +1242,84 @@ ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, avx2); ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, avx2); ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, avx2); + + p.cu[BLOCK_16x16].add_ps = x265_pixel_add_ps_16x16_avx2; + p.cu[BLOCK_32x32].add_ps = x265_pixel_add_ps_32x32_avx2; + p.cu[BLOCK_64x64].add_ps = x265_pixel_add_ps_64x64_avx2; + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps = x265_pixel_add_ps_16x16_avx2; + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps = x265_pixel_add_ps_32x32_avx2; + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps = x265_pixel_add_ps_16x32_avx2; + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps = x265_pixel_add_ps_32x64_avx2; + + p.cu[BLOCK_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2; + p.cu[BLOCK_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2; + p.cu[BLOCK_64x64].sub_ps = x265_pixel_sub_ps_64x64_avx2; + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2; + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2; + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sub_ps = x265_pixel_sub_ps_16x32_avx2; + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sub_ps = x265_pixel_sub_ps_32x64_avx2; + + p.pu[LUMA_16x4].sad = x265_pixel_sad_16x4_avx2; + p.pu[LUMA_16x8].sad = x265_pixel_sad_16x8_avx2; + p.pu[LUMA_16x12].sad = x265_pixel_sad_16x12_avx2; + p.pu[LUMA_16x16].sad = x265_pixel_sad_16x16_avx2; + p.pu[LUMA_16x32].sad = x265_pixel_sad_16x32_avx2; + + p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_avx2; + p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_avx2; + p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_avx2; + p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_avx2; + p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_avx2; + p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_avx2; + p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_avx2; + p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_avx2; + p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_avx2; + p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_avx2; + p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_avx2; + p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_avx2; + p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_avx2; + p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_avx2; + p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_avx2; + p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_avx2; + p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_avx2; + + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = x265_filterPixelToShort_16x4_avx2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = x265_filterPixelToShort_16x8_avx2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = x265_filterPixelToShort_16x12_avx2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = x265_filterPixelToShort_16x16_avx2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = x265_filterPixelToShort_16x32_avx2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s = x265_filterPixelToShort_24x32_avx2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_avx2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_avx2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_avx2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = x265_filterPixelToShort_16x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = x265_filterPixelToShort_16x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = x265_filterPixelToShort_16x24_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = x265_filterPixelToShort_16x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = x265_filterPixelToShort_16x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_avx2; + + p.pu[LUMA_4x4].luma_hps = x265_interp_8tap_horiz_ps_4x4_avx2; + p.pu[LUMA_4x8].luma_hps = x265_interp_8tap_horiz_ps_4x8_avx2; + p.pu[LUMA_4x16].luma_hps = x265_interp_8tap_horiz_ps_4x16_avx2; + + if (cpuMask & X265_CPU_BMI2) + p.scanPosLast = x265_scanPosLast_avx2_bmi2; } } #else // if HIGH_BIT_DEPTH void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) // 8bpp { +#if X86_64 + p.scanPosLast = x265_scanPosLast_x64; +#endif + if (cpuMask & X265_CPU_SSE2) { /* We do not differentiate CPUs which support MMX and not SSE2. We only check @@ -1175,6 +1370,47 @@ CHROMA_420_VSP_FILTERS(_sse2); CHROMA_422_VSP_FILTERS(_sse2); CHROMA_444_VSP_FILTERS(_sse2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vpp = x265_interp_4tap_vert_pp_2x4_sse2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vpp = x265_interp_4tap_vert_pp_2x8_sse2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vpp = x265_interp_4tap_vert_pp_4x2_sse2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vpp = x265_interp_4tap_vert_pp_2x16_sse2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vpp = x265_interp_4tap_vert_pp_4x32_sse2; + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2; + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2; + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2; +#if X86_64 + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vpp = x265_interp_4tap_vert_pp_6x8_sse2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vpp = x265_interp_4tap_vert_pp_8x2_sse2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vpp = x265_interp_4tap_vert_pp_8x6_sse2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vpp = x265_interp_4tap_vert_pp_6x16_sse2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vpp = x265_interp_4tap_vert_pp_8x12_sse2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vpp = x265_interp_4tap_vert_pp_8x64_sse2; + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2; + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2; + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2; + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2; +#endif + + ALL_LUMA_PU(luma_hpp, interp_8tap_horiz_pp, sse2); + p.pu[LUMA_4x4].luma_hpp = x265_interp_8tap_horiz_pp_4x4_sse2; + ALL_LUMA_PU(luma_hps, interp_8tap_horiz_ps, sse2); + p.pu[LUMA_4x4].luma_hps = x265_interp_8tap_horiz_ps_4x4_sse2; + p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_sse3; //p.frameInitLowres = x265_frame_init_lowres_core_mmx2; p.frameInitLowres = x265_frame_init_lowres_core_sse2; @@ -1186,15 +1422,8 @@ ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, sse2); ALL_LUMA_TU_S(ssd_s, pixel_ssd_s_, sse2); - p.cu[BLOCK_4x4].intra_pred[DC_IDX] = x265_intra_pred_dc4_sse2; - p.cu[BLOCK_8x8].intra_pred[DC_IDX] = x265_intra_pred_dc8_sse2; - p.cu[BLOCK_16x16].intra_pred[DC_IDX] = x265_intra_pred_dc16_sse2; - p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_sse2; - - p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2; - p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = x265_intra_pred_planar8_sse2; - p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_sse2; - p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_sse2; + ALL_LUMA_TU_S(intra_pred[PLANAR_IDX], intra_pred_planar, sse2); + ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse2); p.cu[BLOCK_4x4].intra_pred[2] = x265_intra_pred_ang4_2_sse2; p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_sse2; @@ -1204,6 +1433,32 @@ p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_sse2; p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_sse2; p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_sse2; + p.cu[BLOCK_4x4].intra_pred[10] = x265_intra_pred_ang4_10_sse2; + p.cu[BLOCK_4x4].intra_pred[11] = x265_intra_pred_ang4_11_sse2; + p.cu[BLOCK_4x4].intra_pred[12] = x265_intra_pred_ang4_12_sse2; + p.cu[BLOCK_4x4].intra_pred[13] = x265_intra_pred_ang4_13_sse2; + p.cu[BLOCK_4x4].intra_pred[14] = x265_intra_pred_ang4_14_sse2; + p.cu[BLOCK_4x4].intra_pred[15] = x265_intra_pred_ang4_15_sse2; + p.cu[BLOCK_4x4].intra_pred[16] = x265_intra_pred_ang4_16_sse2; + p.cu[BLOCK_4x4].intra_pred[17] = x265_intra_pred_ang4_17_sse2; + p.cu[BLOCK_4x4].intra_pred[18] = x265_intra_pred_ang4_18_sse2; + p.cu[BLOCK_4x4].intra_pred[19] = x265_intra_pred_ang4_17_sse2; + p.cu[BLOCK_4x4].intra_pred[20] = x265_intra_pred_ang4_16_sse2; + p.cu[BLOCK_4x4].intra_pred[21] = x265_intra_pred_ang4_15_sse2; + p.cu[BLOCK_4x4].intra_pred[22] = x265_intra_pred_ang4_14_sse2; + p.cu[BLOCK_4x4].intra_pred[23] = x265_intra_pred_ang4_13_sse2; + p.cu[BLOCK_4x4].intra_pred[24] = x265_intra_pred_ang4_12_sse2; + p.cu[BLOCK_4x4].intra_pred[25] = x265_intra_pred_ang4_11_sse2; + p.cu[BLOCK_4x4].intra_pred[26] = x265_intra_pred_ang4_26_sse2; + p.cu[BLOCK_4x4].intra_pred[27] = x265_intra_pred_ang4_9_sse2; + p.cu[BLOCK_4x4].intra_pred[28] = x265_intra_pred_ang4_8_sse2; + p.cu[BLOCK_4x4].intra_pred[29] = x265_intra_pred_ang4_7_sse2; + p.cu[BLOCK_4x4].intra_pred[30] = x265_intra_pred_ang4_6_sse2; + p.cu[BLOCK_4x4].intra_pred[31] = x265_intra_pred_ang4_5_sse2; + p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_4_sse2; + p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_3_sse2; + + p.cu[BLOCK_4x4].intra_pred_allangs = x265_all_angs_pred_4x4_sse2; p.cu[BLOCK_4x4].calcresidual = x265_getResidual4_sse2; p.cu[BLOCK_8x8].calcresidual = x265_getResidual8_sse2; @@ -1224,6 +1479,12 @@ p.planecopy_sp = x265_downShift_16_sse2; } + if (cpuMask & X265_CPU_SSE3) + { + ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, sse3); + ALL_CHROMA_422_PU(filter_hpp, interp_4tap_horiz_pp, sse3); + ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, sse3); + } if (cpuMask & X265_CPU_SSSE3) { p.pu[LUMA_8x16].sad_x3 = x265_pixel_sad_x3_8x16_ssse3; @@ -1249,48 +1510,86 @@ ASSIGN_SSE_PP(ssse3); p.cu[BLOCK_4x4].sse_pp = x265_pixel_ssd_4x4_ssse3; p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].sse_pp = x265_pixel_ssd_4x8_ssse3; - p.pu[LUMA_4x4].filter_p2s = x265_pixelToShort_4x4_ssse3; - p.pu[LUMA_4x8].filter_p2s = x265_pixelToShort_4x8_ssse3; - p.pu[LUMA_4x16].filter_p2s = x265_pixelToShort_4x16_ssse3; - p.pu[LUMA_8x4].filter_p2s = x265_pixelToShort_8x4_ssse3; - p.pu[LUMA_8x8].filter_p2s = x265_pixelToShort_8x8_ssse3; - p.pu[LUMA_8x16].filter_p2s = x265_pixelToShort_8x16_ssse3; - p.pu[LUMA_8x32].filter_p2s = x265_pixelToShort_8x32_ssse3; - p.pu[LUMA_16x4].filter_p2s = x265_pixelToShort_16x4_ssse3; - p.pu[LUMA_16x8].filter_p2s = x265_pixelToShort_16x8_ssse3; - p.pu[LUMA_16x12].filter_p2s = x265_pixelToShort_16x12_ssse3; - p.pu[LUMA_16x16].filter_p2s = x265_pixelToShort_16x16_ssse3; - p.pu[LUMA_16x32].filter_p2s = x265_pixelToShort_16x32_ssse3; - p.pu[LUMA_16x64].filter_p2s = x265_pixelToShort_16x64_ssse3; - p.pu[LUMA_32x8].filter_p2s = x265_pixelToShort_32x8_ssse3; - p.pu[LUMA_32x16].filter_p2s = x265_pixelToShort_32x16_ssse3; - p.pu[LUMA_32x24].filter_p2s = x265_pixelToShort_32x24_ssse3; - p.pu[LUMA_32x32].filter_p2s = x265_pixelToShort_32x32_ssse3; - p.pu[LUMA_32x64].filter_p2s = x265_pixelToShort_32x64_ssse3; - p.pu[LUMA_64x16].filter_p2s = x265_pixelToShort_64x16_ssse3; - p.pu[LUMA_64x32].filter_p2s = x265_pixelToShort_64x32_ssse3; - p.pu[LUMA_64x48].filter_p2s = x265_pixelToShort_64x48_ssse3; - p.pu[LUMA_64x64].filter_p2s = x265_pixelToShort_64x64_ssse3; - - p.chroma[X265_CSP_I420].p2s = x265_chroma_p2s_ssse3; - p.chroma[X265_CSP_I422].p2s = x265_chroma_p2s_ssse3; p.dst4x4 = x265_dst4_ssse3; p.cu[BLOCK_8x8].idct = x265_idct8_ssse3; ALL_LUMA_TU(count_nonzero, count_nonzero, ssse3); + // MUST be done after LUMA_FILTERS() to overwrite default version + p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_ssse3; + p.frameInitLowres = x265_frame_init_lowres_core_ssse3; p.scale1D_128to64 = x265_scale1D_128to64_ssse3; p.scale2D_64to32 = x265_scale2D_64to32_ssse3; + + p.pu[LUMA_8x4].convert_p2s = x265_filterPixelToShort_8x4_ssse3; + p.pu[LUMA_8x8].convert_p2s = x265_filterPixelToShort_8x8_ssse3; + p.pu[LUMA_8x16].convert_p2s = x265_filterPixelToShort_8x16_ssse3; + p.pu[LUMA_8x32].convert_p2s = x265_filterPixelToShort_8x32_ssse3; + p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_ssse3; + p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_ssse3; + p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_ssse3; + p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_ssse3; + p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_ssse3; + p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_ssse3; + p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_ssse3; + p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_ssse3; + p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_ssse3; + p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_ssse3; + p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_ssse3; + p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_ssse3; + p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_ssse3; + p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_ssse3; + p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_ssse3; + p.pu[LUMA_12x16].convert_p2s = x265_filterPixelToShort_12x16_ssse3; + p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_ssse3; + p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_ssse3; + + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = x265_filterPixelToShort_8x2_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s = x265_filterPixelToShort_8x4_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = x265_filterPixelToShort_8x6_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s = x265_filterPixelToShort_8x8_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s = x265_filterPixelToShort_8x16_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s = x265_filterPixelToShort_8x32_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = x265_filterPixelToShort_16x4_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = x265_filterPixelToShort_16x8_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = x265_filterPixelToShort_16x12_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = x265_filterPixelToShort_16x16_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = x265_filterPixelToShort_16x32_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_ssse3; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s = x265_filterPixelToShort_8x4_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s = x265_filterPixelToShort_8x8_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s = x265_filterPixelToShort_8x12_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s = x265_filterPixelToShort_8x16_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s = x265_filterPixelToShort_8x32_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s = x265_filterPixelToShort_8x64_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s = x265_filterPixelToShort_12x32_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = x265_filterPixelToShort_16x8_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = x265_filterPixelToShort_16x16_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = x265_filterPixelToShort_16x24_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = x265_filterPixelToShort_16x32_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = x265_filterPixelToShort_16x64_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_ssse3; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_ssse3; + p.findPosFirstLast = x265_findPosFirstLast_ssse3; } if (cpuMask & X265_CPU_SSE4) { p.sign = x265_calSign_sse4; p.saoCuOrgE0 = x265_saoCuOrgE0_sse4; p.saoCuOrgE1 = x265_saoCuOrgE1_sse4; - p.saoCuOrgE2 = x265_saoCuOrgE2_sse4; - p.saoCuOrgE3 = x265_saoCuOrgE3_sse4; + p.saoCuOrgE1_2Rows = x265_saoCuOrgE1_2Rows_sse4; + p.saoCuOrgE2[0] = x265_saoCuOrgE2_sse4; + p.saoCuOrgE2[1] = x265_saoCuOrgE2_sse4; + p.saoCuOrgE3[0] = x265_saoCuOrgE3_sse4; + p.saoCuOrgE3[1] = x265_saoCuOrgE3_sse4; p.saoCuOrgB0 = x265_saoCuOrgB0_sse4; LUMA_ADDAVG(sse4); @@ -1321,7 +1620,7 @@ CHROMA_444_VSP_FILTERS_SSE4(_sse4); // MUST be done after LUMA_FILTERS() to overwrite default version - p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_sse4; + p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_ssse3; LUMA_CU_BLOCKCOPY(ps, sse4); CHROMA_420_CU_BLOCKCOPY(ps, sse4); @@ -1348,6 +1647,25 @@ p.cu[BLOCK_4x4].psy_cost_pp = x265_psyCost_pp_4x4_sse4; p.cu[BLOCK_4x4].psy_cost_ss = x265_psyCost_ss_4x4_sse4; + p.pu[LUMA_4x4].convert_p2s = x265_filterPixelToShort_4x4_sse4; + p.pu[LUMA_4x8].convert_p2s = x265_filterPixelToShort_4x8_sse4; + p.pu[LUMA_4x16].convert_p2s = x265_filterPixelToShort_4x16_sse4; + + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = x265_filterPixelToShort_2x4_sse4; + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = x265_filterPixelToShort_2x8_sse4; + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s = x265_filterPixelToShort_4x2_sse4; + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s = x265_filterPixelToShort_4x4_sse4; + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s = x265_filterPixelToShort_4x8_sse4; + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s = x265_filterPixelToShort_4x16_sse4; + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s = x265_filterPixelToShort_6x8_sse4; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = x265_filterPixelToShort_2x8_sse4; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = x265_filterPixelToShort_2x16_sse4; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s = x265_filterPixelToShort_4x4_sse4; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s = x265_filterPixelToShort_4x8_sse4; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s = x265_filterPixelToShort_4x16_sse4; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s = x265_filterPixelToShort_4x32_sse4; + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = x265_filterPixelToShort_6x16_sse4; + #if X86_64 ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4); ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4); @@ -1363,6 +1681,20 @@ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].satd = x265_pixel_satd_8x12_avx; p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].satd = x265_pixel_satd_12x32_avx; p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].satd = x265_pixel_satd_4x32_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].satd = x265_pixel_satd_16x32_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].satd = x265_pixel_satd_32x64_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].satd = x265_pixel_satd_16x16_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].satd = x265_pixel_satd_32x32_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].satd = x265_pixel_satd_16x64_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].satd = x265_pixel_satd_16x8_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].satd = x265_pixel_satd_32x16_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].satd = x265_pixel_satd_8x4_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].satd = x265_pixel_satd_8x16_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].satd = x265_pixel_satd_8x8_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].satd = x265_pixel_satd_8x32_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].satd = x265_pixel_satd_4x8_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].satd = x265_pixel_satd_4x16_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].satd = x265_pixel_satd_4x4_avx; ALL_LUMA_PU(satd, pixel_satd, avx); p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].satd = x265_pixel_satd_4x4_avx; p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].satd = x265_pixel_satd_8x8_avx; @@ -1383,6 +1715,10 @@ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].satd = x265_pixel_satd_32x8_avx; p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].satd = x265_pixel_satd_8x32_avx; ASSIGN_SA8D(avx); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sa8d = x265_pixel_sa8d_32x32_avx; + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sa8d = x265_pixel_sa8d_16x16_avx; + p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sa8d = x265_pixel_sa8d_8x8_avx; + p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].sa8d = x265_pixel_satd_4x4_avx; ASSIGN_SSE_PP(avx); p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sse_pp = x265_pixel_ssd_8x8_avx; ASSIGN_SSE_SS(avx); @@ -1405,6 +1741,7 @@ p.chroma[X265_CSP_I420].cu[CHROMA_420_16x16].copy_ss = x265_blockcopy_ss_16x16_avx; p.chroma[X265_CSP_I420].cu[CHROMA_420_32x32].copy_ss = x265_blockcopy_ss_32x32_avx; p.chroma[X265_CSP_I422].cu[CHROMA_422_16x32].copy_ss = x265_blockcopy_ss_16x32_avx; + p.chroma[X265_CSP_I422].cu[CHROMA_422_32x64].copy_ss = x265_blockcopy_ss_32x64_avx; p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].copy_pp = x265_blockcopy_pp_32x8_avx; p.pu[LUMA_32x8].copy_pp = x265_blockcopy_pp_32x8_avx; @@ -1447,6 +1784,26 @@ #if X86_64 if (cpuMask & X265_CPU_AVX2) { + p.planecopy_sp = x265_downShift_16_avx2; + + p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_avx2; + + p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_avx2; + p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_avx2; + + p.idst4x4 = x265_idst4_avx2; + p.dst4x4 = x265_dst4_avx2; + p.scale2D_64to32 = x265_scale2D_64to32_avx2; + p.saoCuOrgE0 = x265_saoCuOrgE0_avx2; + p.saoCuOrgE1 = x265_saoCuOrgE1_avx2; + p.saoCuOrgE1_2Rows = x265_saoCuOrgE1_2Rows_avx2; + p.saoCuOrgE2[0] = x265_saoCuOrgE2_avx2; + p.saoCuOrgE2[1] = x265_saoCuOrgE2_32_avx2; + p.saoCuOrgE3[0] = x265_saoCuOrgE3_avx2; + p.saoCuOrgE3[1] = x265_saoCuOrgE3_32_avx2; + p.saoCuOrgB0 = x265_saoCuOrgB0_avx2; + p.sign = x265_calSign_avx2; + p.cu[BLOCK_4x4].psy_cost_ss = x265_psyCost_ss_4x4_avx2; p.cu[BLOCK_8x8].psy_cost_ss = x265_psyCost_ss_8x8_avx2; p.cu[BLOCK_16x16].psy_cost_ss = x265_psyCost_ss_16x16_avx2; @@ -1494,31 +1851,50 @@ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].addAvg = x265_addAvg_8x8_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].addAvg = x265_addAvg_8x16_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].addAvg = x265_addAvg_8x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].addAvg = x265_addAvg_12x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].addAvg = x265_addAvg_16x4_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].addAvg = x265_addAvg_16x8_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].addAvg = x265_addAvg_16x12_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].addAvg = x265_addAvg_16x16_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].addAvg = x265_addAvg_16x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg = x265_addAvg_32x8_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg = x265_addAvg_32x16_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg = x265_addAvg_32x24_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg = x265_addAvg_32x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].addAvg = x265_addAvg_8x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].addAvg = x265_addAvg_8x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].addAvg = x265_addAvg_8x12_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].addAvg = x265_addAvg_8x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].addAvg = x265_addAvg_8x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].addAvg = x265_addAvg_8x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].addAvg = x265_addAvg_12x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].addAvg = x265_addAvg_16x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].addAvg = x265_addAvg_16x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].addAvg = x265_addAvg_16x24_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].addAvg = x265_addAvg_16x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].addAvg = x265_addAvg_16x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].addAvg = x265_addAvg_24x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg = x265_addAvg_32x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg = x265_addAvg_32x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg = x265_addAvg_32x48_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg = x265_addAvg_32x64_avx2; + p.cu[BLOCK_16x16].add_ps = x265_pixel_add_ps_16x16_avx2; p.cu[BLOCK_32x32].add_ps = x265_pixel_add_ps_32x32_avx2; p.cu[BLOCK_64x64].add_ps = x265_pixel_add_ps_64x64_avx2; p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps = x265_pixel_add_ps_16x16_avx2; p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps = x265_pixel_add_ps_32x32_avx2; + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps = x265_pixel_add_ps_16x32_avx2; + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps = x265_pixel_add_ps_32x64_avx2; p.cu[BLOCK_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2; p.cu[BLOCK_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2; p.cu[BLOCK_64x64].sub_ps = x265_pixel_sub_ps_64x64_avx2; p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2; p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2; + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sub_ps = x265_pixel_sub_ps_16x32_avx2; + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sub_ps = x265_pixel_sub_ps_32x64_avx2; p.pu[LUMA_16x4].pixelavg_pp = x265_pixel_avg_16x4_avx2; p.pu[LUMA_16x8].pixelavg_pp = x265_pixel_avg_16x8_avx2; @@ -1543,6 +1919,22 @@ p.pu[LUMA_8x16].satd = x265_pixel_satd_8x16_avx2; p.pu[LUMA_8x8].satd = x265_pixel_satd_8x8_avx2; + p.pu[LUMA_16x4].satd = x265_pixel_satd_16x4_avx2; + p.pu[LUMA_16x12].satd = x265_pixel_satd_16x12_avx2; + p.pu[LUMA_16x32].satd = x265_pixel_satd_16x32_avx2; + p.pu[LUMA_16x64].satd = x265_pixel_satd_16x64_avx2; + + p.pu[LUMA_32x8].satd = x265_pixel_satd_32x8_avx2; + p.pu[LUMA_32x16].satd = x265_pixel_satd_32x16_avx2; + p.pu[LUMA_32x24].satd = x265_pixel_satd_32x24_avx2; + p.pu[LUMA_32x32].satd = x265_pixel_satd_32x32_avx2; + p.pu[LUMA_32x64].satd = x265_pixel_satd_32x64_avx2; + p.pu[LUMA_48x64].satd = x265_pixel_satd_48x64_avx2; + p.pu[LUMA_64x16].satd = x265_pixel_satd_64x16_avx2; + p.pu[LUMA_64x32].satd = x265_pixel_satd_64x32_avx2; + p.pu[LUMA_64x48].satd = x265_pixel_satd_64x48_avx2; + p.pu[LUMA_64x64].satd = x265_pixel_satd_64x64_avx2; + p.pu[LUMA_32x8].sad = x265_pixel_sad_32x8_avx2; p.pu[LUMA_32x16].sad = x265_pixel_sad_32x16_avx2; p.pu[LUMA_32x24].sad = x265_pixel_sad_32x24_avx2; @@ -1602,8 +1994,37 @@ p.scale1D_128to64 = x265_scale1D_128to64_avx2; p.weight_pp = x265_weight_pp_avx2; + p.weight_sp = x265_weight_sp_avx2; // intra_pred functions + p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_avx2; + p.cu[BLOCK_4x4].intra_pred[4] = x265_intra_pred_ang4_4_avx2; + p.cu[BLOCK_4x4].intra_pred[5] = x265_intra_pred_ang4_5_avx2; + p.cu[BLOCK_4x4].intra_pred[6] = x265_intra_pred_ang4_6_avx2; + p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_avx2; + p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_avx2; + p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_avx2; + p.cu[BLOCK_4x4].intra_pred[11] = x265_intra_pred_ang4_11_avx2; + p.cu[BLOCK_4x4].intra_pred[12] = x265_intra_pred_ang4_12_avx2; + p.cu[BLOCK_4x4].intra_pred[13] = x265_intra_pred_ang4_13_avx2; + p.cu[BLOCK_4x4].intra_pred[14] = x265_intra_pred_ang4_14_avx2; + p.cu[BLOCK_4x4].intra_pred[15] = x265_intra_pred_ang4_15_avx2; + p.cu[BLOCK_4x4].intra_pred[16] = x265_intra_pred_ang4_16_avx2; + p.cu[BLOCK_4x4].intra_pred[17] = x265_intra_pred_ang4_17_avx2; + p.cu[BLOCK_4x4].intra_pred[19] = x265_intra_pred_ang4_19_avx2; + p.cu[BLOCK_4x4].intra_pred[20] = x265_intra_pred_ang4_20_avx2; + p.cu[BLOCK_4x4].intra_pred[21] = x265_intra_pred_ang4_21_avx2; + p.cu[BLOCK_4x4].intra_pred[22] = x265_intra_pred_ang4_22_avx2; + p.cu[BLOCK_4x4].intra_pred[23] = x265_intra_pred_ang4_23_avx2; + p.cu[BLOCK_4x4].intra_pred[24] = x265_intra_pred_ang4_24_avx2; + p.cu[BLOCK_4x4].intra_pred[25] = x265_intra_pred_ang4_25_avx2; + p.cu[BLOCK_4x4].intra_pred[27] = x265_intra_pred_ang4_27_avx2; + p.cu[BLOCK_4x4].intra_pred[28] = x265_intra_pred_ang4_28_avx2; + p.cu[BLOCK_4x4].intra_pred[29] = x265_intra_pred_ang4_29_avx2; + p.cu[BLOCK_4x4].intra_pred[30] = x265_intra_pred_ang4_30_avx2; + p.cu[BLOCK_4x4].intra_pred[31] = x265_intra_pred_ang4_31_avx2; + p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_32_avx2; + p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_33_avx2; p.cu[BLOCK_8x8].intra_pred[3] = x265_intra_pred_ang8_3_avx2; p.cu[BLOCK_8x8].intra_pred[33] = x265_intra_pred_ang8_33_avx2; p.cu[BLOCK_8x8].intra_pred[4] = x265_intra_pred_ang8_4_avx2; @@ -1622,6 +2043,24 @@ p.cu[BLOCK_8x8].intra_pred[12] = x265_intra_pred_ang8_12_avx2; p.cu[BLOCK_8x8].intra_pred[24] = x265_intra_pred_ang8_24_avx2; p.cu[BLOCK_8x8].intra_pred[11] = x265_intra_pred_ang8_11_avx2; + p.cu[BLOCK_8x8].intra_pred[13] = x265_intra_pred_ang8_13_avx2; + p.cu[BLOCK_8x8].intra_pred[20] = x265_intra_pred_ang8_20_avx2; + p.cu[BLOCK_8x8].intra_pred[21] = x265_intra_pred_ang8_21_avx2; + p.cu[BLOCK_8x8].intra_pred[22] = x265_intra_pred_ang8_22_avx2; + p.cu[BLOCK_8x8].intra_pred[23] = x265_intra_pred_ang8_23_avx2; + p.cu[BLOCK_8x8].intra_pred[14] = x265_intra_pred_ang8_14_avx2; + p.cu[BLOCK_8x8].intra_pred[15] = x265_intra_pred_ang8_15_avx2; + p.cu[BLOCK_8x8].intra_pred[16] = x265_intra_pred_ang8_16_avx2; + p.cu[BLOCK_16x16].intra_pred[3] = x265_intra_pred_ang16_3_avx2; + p.cu[BLOCK_16x16].intra_pred[4] = x265_intra_pred_ang16_4_avx2; + p.cu[BLOCK_16x16].intra_pred[5] = x265_intra_pred_ang16_5_avx2; + p.cu[BLOCK_16x16].intra_pred[6] = x265_intra_pred_ang16_6_avx2; + p.cu[BLOCK_16x16].intra_pred[7] = x265_intra_pred_ang16_7_avx2; + p.cu[BLOCK_16x16].intra_pred[8] = x265_intra_pred_ang16_8_avx2; + p.cu[BLOCK_16x16].intra_pred[9] = x265_intra_pred_ang16_9_avx2; + p.cu[BLOCK_16x16].intra_pred[12] = x265_intra_pred_ang16_12_avx2; + p.cu[BLOCK_16x16].intra_pred[11] = x265_intra_pred_ang16_11_avx2; + p.cu[BLOCK_16x16].intra_pred[13] = x265_intra_pred_ang16_13_avx2; p.cu[BLOCK_16x16].intra_pred[25] = x265_intra_pred_ang16_25_avx2; p.cu[BLOCK_16x16].intra_pred[28] = x265_intra_pred_ang16_28_avx2; p.cu[BLOCK_16x16].intra_pred[27] = x265_intra_pred_ang16_27_avx2; @@ -1642,6 +2081,16 @@ p.cu[BLOCK_32x32].intra_pred[30] = x265_intra_pred_ang32_30_avx2; p.cu[BLOCK_32x32].intra_pred[31] = x265_intra_pred_ang32_31_avx2; p.cu[BLOCK_32x32].intra_pred[32] = x265_intra_pred_ang32_32_avx2; + p.cu[BLOCK_32x32].intra_pred[33] = x265_intra_pred_ang32_33_avx2; + p.cu[BLOCK_32x32].intra_pred[25] = x265_intra_pred_ang32_25_avx2; + p.cu[BLOCK_32x32].intra_pred[24] = x265_intra_pred_ang32_24_avx2; + p.cu[BLOCK_32x32].intra_pred[23] = x265_intra_pred_ang32_23_avx2; + p.cu[BLOCK_32x32].intra_pred[22] = x265_intra_pred_ang32_22_avx2; + p.cu[BLOCK_32x32].intra_pred[21] = x265_intra_pred_ang32_21_avx2; + p.cu[BLOCK_32x32].intra_pred[18] = x265_intra_pred_ang32_18_avx2; + + // all_angs primitives + p.cu[BLOCK_4x4].intra_pred_allangs = x265_all_angs_pred_4x4_avx2; // copy_sp primitives p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2; @@ -1725,6 +2174,8 @@ p.pu[LUMA_64x48].luma_hps = x265_interp_8tap_horiz_ps_64x48_avx2; p.pu[LUMA_64x32].luma_hps = x265_interp_8tap_horiz_ps_64x32_avx2; p.pu[LUMA_64x16].luma_hps = x265_interp_8tap_horiz_ps_64x16_avx2; + p.pu[LUMA_12x16].luma_hps = x265_interp_8tap_horiz_ps_12x16_avx2; + p.pu[LUMA_24x32].luma_hps = x265_interp_8tap_horiz_ps_24x32_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_hpp = x265_interp_4tap_horiz_pp_8x8_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_hpp = x265_interp_4tap_horiz_pp_4x4_avx2; @@ -1744,6 +2195,7 @@ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hpp = x265_interp_4tap_horiz_pp_16x32_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_hpp = x265_interp_4tap_horiz_pp_6x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_hpp = x265_interp_4tap_horiz_pp_6x16_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hpp = x265_interp_4tap_horiz_pp_32x16_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hpp = x265_interp_4tap_horiz_pp_32x24_avx2; @@ -1777,6 +2229,7 @@ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hps = x265_interp_4tap_horiz_ps_16x8_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hps = x265_interp_4tap_horiz_ps_16x4_avx2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_hps = x265_interp_4tap_horiz_ps_24x32_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hps = x265_interp_4tap_horiz_ps_32x16_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hps = x265_interp_4tap_horiz_ps_32x24_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hps = x265_interp_4tap_horiz_ps_32x8_avx2; @@ -1887,8 +2340,353 @@ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vss = x265_interp_4tap_vert_ss_32x16_avx2; p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vss = x265_interp_4tap_vert_ss_32x24_avx2; - if ((cpuMask & X265_CPU_BMI1) && (cpuMask & X265_CPU_BMI2)) - p.findPosLast = x265_findPosLast_x64; + //i422 for chroma_vss + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vss = x265_interp_4tap_vert_ss_4x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vss = x265_interp_4tap_vert_ss_8x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vss = x265_interp_4tap_vert_ss_16x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vss = x265_interp_4tap_vert_ss_4x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vss = x265_interp_4tap_vert_ss_2x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vss = x265_interp_4tap_vert_ss_8x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vss = x265_interp_4tap_vert_ss_4x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vss = x265_interp_4tap_vert_ss_16x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vss = x265_interp_4tap_vert_ss_8x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vss = x265_interp_4tap_vert_ss_32x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vss = x265_interp_4tap_vert_ss_8x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vss = x265_interp_4tap_vert_ss_32x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vss = x265_interp_4tap_vert_ss_32x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vss = x265_interp_4tap_vert_ss_16x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vss = x265_interp_4tap_vert_ss_24x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vss = x265_interp_4tap_vert_ss_8x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vss = x265_interp_4tap_vert_ss_32x48_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vss = x265_interp_4tap_vert_ss_8x12_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vss = x265_interp_4tap_vert_ss_6x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vss = x265_interp_4tap_vert_ss_2x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vss = x265_interp_4tap_vert_ss_16x24_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vss = x265_interp_4tap_vert_ss_12x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vss = x265_interp_4tap_vert_ss_4x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vss = x265_interp_4tap_vert_ss_2x4_avx2; + + //i444 for chroma_vss + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vss = x265_interp_4tap_vert_ss_4x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vss = x265_interp_4tap_vert_ss_8x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vss = x265_interp_4tap_vert_ss_16x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vss = x265_interp_4tap_vert_ss_32x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vss = x265_interp_4tap_vert_ss_64x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vss = x265_interp_4tap_vert_ss_8x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vss = x265_interp_4tap_vert_ss_4x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vss = x265_interp_4tap_vert_ss_16x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vss = x265_interp_4tap_vert_ss_8x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vss = x265_interp_4tap_vert_ss_32x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vss = x265_interp_4tap_vert_ss_16x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vss = x265_interp_4tap_vert_ss_16x12_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vss = x265_interp_4tap_vert_ss_12x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vss = x265_interp_4tap_vert_ss_16x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vss = x265_interp_4tap_vert_ss_4x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vss = x265_interp_4tap_vert_ss_32x24_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vss = x265_interp_4tap_vert_ss_24x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vss = x265_interp_4tap_vert_ss_32x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vss = x265_interp_4tap_vert_ss_8x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vss = x265_interp_4tap_vert_ss_64x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vss = x265_interp_4tap_vert_ss_32x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vss = x265_interp_4tap_vert_ss_64x48_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vss = x265_interp_4tap_vert_ss_48x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vss = x265_interp_4tap_vert_ss_64x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vss = x265_interp_4tap_vert_ss_16x64_avx2; + + p.pu[LUMA_16x16].luma_hvpp = x265_interp_8tap_hv_pp_16x16_avx2; + + p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_avx2; + p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_avx2; + p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_avx2; + p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_avx2; + p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_avx2; + p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_avx2; + p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_avx2; + p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_avx2; + p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_avx2; + p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_avx2; + p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_avx2; + + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s = x265_filterPixelToShort_24x32_avx2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_avx2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_avx2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_avx2; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_avx2; + + //i422 for chroma_hpp + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_hpp = x265_interp_4tap_horiz_pp_12x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hpp = x265_interp_4tap_horiz_pp_24x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hpp = x265_interp_4tap_horiz_pp_2x16_avx2; + + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hpp = x265_interp_4tap_horiz_pp_2x16_avx2; + + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_hpp = x265_interp_4tap_horiz_pp_4x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_hpp = x265_interp_4tap_horiz_pp_4x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_hpp = x265_interp_4tap_horiz_pp_4x16_avx2; + + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hpp = x265_interp_4tap_horiz_pp_8x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hpp = x265_interp_4tap_horiz_pp_8x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hpp = x265_interp_4tap_horiz_pp_8x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hpp = x265_interp_4tap_horiz_pp_8x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hpp = x265_interp_4tap_horiz_pp_8x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hpp = x265_interp_4tap_horiz_pp_8x12_avx2; + + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hpp = x265_interp_4tap_horiz_pp_16x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hpp = x265_interp_4tap_horiz_pp_16x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hpp = x265_interp_4tap_horiz_pp_16x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hpp = x265_interp_4tap_horiz_pp_16x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hpp = x265_interp_4tap_horiz_pp_16x24_avx2; + + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hpp = x265_interp_4tap_horiz_pp_32x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hpp = x265_interp_4tap_horiz_pp_32x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hpp = x265_interp_4tap_horiz_pp_32x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hpp = x265_interp_4tap_horiz_pp_32x48_avx2; + + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_hpp = x265_interp_4tap_horiz_pp_2x8_avx2; + + //i444 filters hpp + + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_hpp = x265_interp_4tap_horiz_pp_4x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hpp = x265_interp_4tap_horiz_pp_8x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hpp = x265_interp_4tap_horiz_pp_16x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hpp = x265_interp_4tap_horiz_pp_32x32_avx2; + + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_hpp = x265_interp_4tap_horiz_pp_4x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_hpp = x265_interp_4tap_horiz_pp_4x16_avx2; + + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hpp = x265_interp_4tap_horiz_pp_8x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hpp = x265_interp_4tap_horiz_pp_8x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hpp = x265_interp_4tap_horiz_pp_8x32_avx2; + + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hpp = x265_interp_4tap_horiz_pp_16x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hpp = x265_interp_4tap_horiz_pp_16x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hpp = x265_interp_4tap_horiz_pp_16x12_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hpp = x265_interp_4tap_horiz_pp_16x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hpp = x265_interp_4tap_horiz_pp_16x64_avx2; + + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_hpp = x265_interp_4tap_horiz_pp_12x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hpp = x265_interp_4tap_horiz_pp_24x32_avx2; + + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hpp = x265_interp_4tap_horiz_pp_32x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hpp = x265_interp_4tap_horiz_pp_32x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hpp = x265_interp_4tap_horiz_pp_32x24_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hpp = x265_interp_4tap_horiz_pp_32x8_avx2; + + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hpp = x265_interp_4tap_horiz_pp_64x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hpp = x265_interp_4tap_horiz_pp_64x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hpp = x265_interp_4tap_horiz_pp_64x48_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hpp = x265_interp_4tap_horiz_pp_64x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hpp = x265_interp_4tap_horiz_pp_48x64_avx2; + + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_hps = x265_interp_4tap_horiz_ps_4x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_hps = x265_interp_4tap_horiz_ps_4x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_hps = x265_interp_4tap_horiz_ps_4x16_avx2; + + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hps = x265_interp_4tap_horiz_ps_8x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hps = x265_interp_4tap_horiz_ps_8x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hps = x265_interp_4tap_horiz_ps_8x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hps = x265_interp_4tap_horiz_ps_8x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hps = x265_interp_4tap_horiz_ps_8x64_avx2; //adding macro call + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hps = x265_interp_4tap_horiz_ps_8x12_avx2; //adding macro call + + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hps = x265_interp_4tap_horiz_ps_16x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hps = x265_interp_4tap_horiz_ps_16x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hps = x265_interp_4tap_horiz_ps_16x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hps = x265_interp_4tap_horiz_ps_16x64_avx2;//adding macro call + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hps = x265_interp_4tap_horiz_ps_16x24_avx2;//adding macro call + + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hps = x265_interp_4tap_horiz_ps_32x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hps = x265_interp_4tap_horiz_ps_32x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hps = x265_interp_4tap_horiz_ps_32x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hps = x265_interp_4tap_horiz_ps_32x48_avx2; + + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_hps = x265_interp_4tap_horiz_ps_2x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hps = x265_interp_4tap_horiz_ps_24x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hps = x265_interp_4tap_horiz_ps_2x16_avx2; + + //i444 chroma_hps + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hps = x265_interp_4tap_horiz_ps_64x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hps = x265_interp_4tap_horiz_ps_64x48_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hps = x265_interp_4tap_horiz_ps_64x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hps = x265_interp_4tap_horiz_ps_64x64_avx2; + + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_hps = x265_interp_4tap_horiz_ps_4x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hps = x265_interp_4tap_horiz_ps_8x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hps = x265_interp_4tap_horiz_ps_16x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hps = x265_interp_4tap_horiz_ps_32x32_avx2; + + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_hps = x265_interp_4tap_horiz_ps_4x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_hps = x265_interp_4tap_horiz_ps_4x16_avx2; + + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hps = x265_interp_4tap_horiz_ps_8x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hps = x265_interp_4tap_horiz_ps_8x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hps = x265_interp_4tap_horiz_ps_8x32_avx2; + + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hps = x265_interp_4tap_horiz_ps_16x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hps = x265_interp_4tap_horiz_ps_16x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hps = x265_interp_4tap_horiz_ps_16x12_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hps = x265_interp_4tap_horiz_ps_16x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hps = x265_interp_4tap_horiz_ps_16x64_avx2; + + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hps = x265_interp_4tap_horiz_ps_24x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hps = x265_interp_4tap_horiz_ps_48x64_avx2; + + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hps = x265_interp_4tap_horiz_ps_32x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hps = x265_interp_4tap_horiz_ps_32x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hps = x265_interp_4tap_horiz_ps_32x24_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hps = x265_interp_4tap_horiz_ps_32x8_avx2; + + //i422 for chroma_vsp + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vsp = x265_interp_4tap_vert_sp_4x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vsp = x265_interp_4tap_vert_sp_8x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vsp = x265_interp_4tap_vert_sp_16x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vsp = x265_interp_4tap_vert_sp_4x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vsp = x265_interp_4tap_vert_sp_2x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vsp = x265_interp_4tap_vert_sp_8x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vsp = x265_interp_4tap_vert_sp_4x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vsp = x265_interp_4tap_vert_sp_16x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vsp = x265_interp_4tap_vert_sp_8x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vsp = x265_interp_4tap_vert_sp_32x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vsp = x265_interp_4tap_vert_sp_8x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vsp = x265_interp_4tap_vert_sp_16x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vsp = x265_interp_4tap_vert_sp_32x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vsp = x265_interp_4tap_vert_sp_32x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vsp = x265_interp_4tap_vert_sp_16x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vsp = x265_interp_4tap_vert_sp_24x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vsp = x265_interp_4tap_vert_sp_8x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vsp = x265_interp_4tap_vert_sp_32x48_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vsp = x265_interp_4tap_vert_sp_8x12_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vsp = x265_interp_4tap_vert_sp_6x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vsp = x265_interp_4tap_vert_sp_2x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vsp = x265_interp_4tap_vert_sp_16x24_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vsp = x265_interp_4tap_vert_sp_12x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vsp = x265_interp_4tap_vert_sp_4x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vsp = x265_interp_4tap_vert_sp_2x4_avx2; + + //i444 for chroma_vsp + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vsp = x265_interp_4tap_vert_sp_4x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vsp = x265_interp_4tap_vert_sp_8x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vsp = x265_interp_4tap_vert_sp_16x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vsp = x265_interp_4tap_vert_sp_32x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vsp = x265_interp_4tap_vert_sp_64x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vsp = x265_interp_4tap_vert_sp_8x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vsp = x265_interp_4tap_vert_sp_4x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vsp = x265_interp_4tap_vert_sp_16x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vsp = x265_interp_4tap_vert_sp_8x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vsp = x265_interp_4tap_vert_sp_32x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vsp = x265_interp_4tap_vert_sp_16x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vsp = x265_interp_4tap_vert_sp_16x12_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vsp = x265_interp_4tap_vert_sp_12x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vsp = x265_interp_4tap_vert_sp_16x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vsp = x265_interp_4tap_vert_sp_4x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vsp = x265_interp_4tap_vert_sp_32x24_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vsp = x265_interp_4tap_vert_sp_24x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vsp = x265_interp_4tap_vert_sp_32x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vsp = x265_interp_4tap_vert_sp_8x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vsp = x265_interp_4tap_vert_sp_64x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vsp = x265_interp_4tap_vert_sp_32x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vsp = x265_interp_4tap_vert_sp_64x48_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vsp = x265_interp_4tap_vert_sp_48x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vsp = x265_interp_4tap_vert_sp_64x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vsp = x265_interp_4tap_vert_sp_16x64_avx2; + + //i422 for chroma_vps + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vps = x265_interp_4tap_vert_ps_4x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vps = x265_interp_4tap_vert_ps_8x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vps = x265_interp_4tap_vert_ps_16x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vps = x265_interp_4tap_vert_ps_4x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vps = x265_interp_4tap_vert_ps_2x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vps = x265_interp_4tap_vert_ps_8x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vps = x265_interp_4tap_vert_ps_4x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vps = x265_interp_4tap_vert_ps_16x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vps = x265_interp_4tap_vert_ps_8x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vps = x265_interp_4tap_vert_ps_32x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vps = x265_interp_4tap_vert_ps_8x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vps = x265_interp_4tap_vert_ps_16x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vps = x265_interp_4tap_vert_ps_32x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vps = x265_interp_4tap_vert_ps_16x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vps = x265_interp_4tap_vert_ps_8x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vps = x265_interp_4tap_vert_ps_32x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vps = x265_interp_4tap_vert_ps_32x48_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vps = x265_interp_4tap_vert_ps_12x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vps = x265_interp_4tap_vert_ps_8x12_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vps = x265_interp_4tap_vert_ps_2x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vps = x265_interp_4tap_vert_ps_16x24_avx2; + + //i444 for chroma_vps + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vps = x265_interp_4tap_vert_ps_4x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vps = x265_interp_4tap_vert_ps_8x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vps = x265_interp_4tap_vert_ps_16x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vps = x265_interp_4tap_vert_ps_32x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vps = x265_interp_4tap_vert_ps_8x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vps = x265_interp_4tap_vert_ps_4x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vps = x265_interp_4tap_vert_ps_16x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vps = x265_interp_4tap_vert_ps_8x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vps = x265_interp_4tap_vert_ps_32x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vps = x265_interp_4tap_vert_ps_16x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vps = x265_interp_4tap_vert_ps_16x12_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vps = x265_interp_4tap_vert_ps_12x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vps = x265_interp_4tap_vert_ps_16x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vps = x265_interp_4tap_vert_ps_4x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vps = x265_interp_4tap_vert_ps_32x24_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vps = x265_interp_4tap_vert_ps_24x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vps = x265_interp_4tap_vert_ps_32x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vps = x265_interp_4tap_vert_ps_8x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vps = x265_interp_4tap_vert_ps_16x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vps = x265_interp_4tap_vert_ps_32x64_avx2; + + //i422 for chroma_vpp + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vpp = x265_interp_4tap_vert_pp_16x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vpp = x265_interp_4tap_vert_pp_2x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vpp = x265_interp_4tap_vert_pp_16x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vpp = x265_interp_4tap_vert_pp_32x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vpp = x265_interp_4tap_vert_pp_16x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vpp = x265_interp_4tap_vert_pp_32x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vpp = x265_interp_4tap_vert_pp_16x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vpp = x265_interp_4tap_vert_pp_8x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vpp = x265_interp_4tap_vert_pp_32x64_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vpp = x265_interp_4tap_vert_pp_32x48_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vpp = x265_interp_4tap_vert_pp_12x32_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vpp = x265_interp_4tap_vert_pp_8x12_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vpp = x265_interp_4tap_vert_pp_2x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vpp = x265_interp_4tap_vert_pp_16x24_avx2; + + //i444 for chroma_vpp + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vpp = x265_interp_4tap_vert_pp_16x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vpp = x265_interp_4tap_vert_pp_32x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vpp = x265_interp_4tap_vert_pp_16x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vpp = x265_interp_4tap_vert_pp_32x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vpp = x265_interp_4tap_vert_pp_16x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vpp = x265_interp_4tap_vert_pp_16x12_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vpp = x265_interp_4tap_vert_pp_12x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vpp = x265_interp_4tap_vert_pp_16x4_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vpp = x265_interp_4tap_vert_pp_32x24_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vpp = x265_interp_4tap_vert_pp_24x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vpp = x265_interp_4tap_vert_pp_32x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vpp = x265_interp_4tap_vert_pp_16x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vpp = x265_interp_4tap_vert_pp_32x64_avx2; + + if (cpuMask & X265_CPU_BMI2) + p.scanPosLast = x265_scanPosLast_avx2_bmi2; } #endif }
View file
x265_1.6.tar.gz/source/common/x86/const-a.asm -> x265_1.7.tar.gz/source/common/x86/const-a.asm
Changed
@@ -29,81 +29,100 @@ SECTION_RODATA 32 -const pb_1, times 32 db 1 +;; 8-bit constants -const hsub_mul, times 16 db 1, -1 -const pw_1, times 16 dw 1 -const pw_16, times 16 dw 16 -const pw_32, times 16 dw 32 -const pw_128, times 16 dw 128 -const pw_256, times 16 dw 256 -const pw_257, times 16 dw 257 -const pw_512, times 16 dw 512 -const pw_1023, times 8 dw 1023 -ALIGN 32 -const pw_1024, times 16 dw 1024 -const pw_4096, times 16 dw 4096 -const pw_00ff, times 16 dw 0x00ff -ALIGN 32 -const pw_pixel_max,times 16 dw ((1 << BIT_DEPTH)-1) -const deinterleave_shufd, dd 0,4,1,5,2,6,3,7 -const pb_unpackbd1, times 2 db 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 -const pb_unpackbd2, times 2 db 4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7 -const pb_unpackwq1, db 0,1,0,1,0,1,0,1,2,3,2,3,2,3,2,3 -const pb_unpackwq2, db 4,5,4,5,4,5,4,5,6,7,6,7,6,7,6,7 -const pw_swap, times 2 db 6,7,4,5,2,3,0,1 +const pb_0, times 16 db 0 +const pb_1, times 32 db 1 +const pb_2, times 32 db 2 +const pb_3, times 16 db 3 +const pb_4, times 32 db 4 +const pb_8, times 32 db 8 +const pb_15, times 32 db 15 +const pb_16, times 32 db 16 +const pb_32, times 32 db 32 +const pb_64, times 32 db 64 +const pb_128, times 16 db 128 +const pb_a1, times 16 db 0xa1 -const pb_2, times 32 db 2 -const pb_4, times 32 db 4 -const pb_16, times 32 db 16 -const pb_64, times 32 db 64 -const pb_01, times 8 db 0,1 -const pb_0, times 16 db 0 -const pb_a1, times 16 db 0xa1 -const pb_3, times 16 db 3 -const pb_8, times 32 db 8 -const pb_32, times 32 db 32 -const pb_128, times 16 db 128 -const pb_shuf8x8c, db 0,0,0,0,2,2,2,2,4,4,4,4,6,6,6,6 +const pb_01, times 8 db 0, 1 +const hsub_mul, times 16 db 1, -1 +const pw_swap, times 2 db 6, 7, 4, 5, 2, 3, 0, 1 +const pb_unpackbd1, times 2 db 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3 +const pb_unpackbd2, times 2 db 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7 +const pb_unpackwq1, times 1 db 0, 1, 0, 1, 0, 1, 0, 1, 2, 3, 2, 3, 2, 3, 2, 3 +const pb_unpackwq2, times 1 db 4, 5, 4, 5, 4, 5, 4, 5, 6, 7, 6, 7, 6, 7, 6, 7 +const pb_shuf8x8c, times 1 db 0, 0, 0, 0, 2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 6, 6 +const pb_movemask, times 16 db 0x00 + times 16 db 0xFF +const pb_0000000000000F0F, times 2 db 0xff, 0x00 + times 12 db 0x00 +const pb_000000000000000F, db 0xff + times 15 db 0x00 -const pw_0_15, times 2 dw 0, 1, 2, 3, 4, 5, 6, 7 -const pw_2, times 8 dw 2 -const pw_m2, times 8 dw -2 -const pw_4, times 8 dw 4 -const pw_8, times 8 dw 8 -const pw_64, times 8 dw 64 -const pw_256, times 8 dw 256 -const pw_32_0, times 4 dw 32, - times 4 dw 0 -const pw_2000, times 16 dw 0x2000 -const pw_8000, times 8 dw 0x8000 -const pw_3fff, times 8 dw 0x3fff -const pw_ppppmmmm, dw 1,1,1,1,-1,-1,-1,-1 -const pw_ppmmppmm, dw 1,1,-1,-1,1,1,-1,-1 -const pw_pmpmpmpm, dw 1,-1,1,-1,1,-1,1,-1 -const pw_pmmpzzzz, dw 1,-1,-1,1,0,0,0,0 -const pd_1, times 8 dd 1 -const pd_2, times 8 dd 2 -const pd_4, times 4 dd 4 -const pd_8, times 4 dd 8 -const pd_16, times 4 dd 16 -const pd_32, times 4 dd 32 -const pd_64, times 4 dd 64 -const pd_128, times 4 dd 128 -const pd_256, times 4 dd 256 -const pd_512, times 4 dd 512 -const pd_1024, times 4 dd 1024 -const pd_2048, times 4 dd 2048 -const pd_ffff, times 4 dd 0xffff -const pd_32767, times 4 dd 32767 -const pd_n32768, times 4 dd 0xffff8000 -const pw_ff00, times 8 dw 0xff00 +;; 16-bit constants -const multi_2Row, dw 1, 2, 3, 4, 1, 2, 3, 4 -const multiL, dw 1, 2, 3, 4, 5, 6, 7, 8 -const multiH, dw 9, 10, 11, 12, 13, 14, 15, 16 -const multiH2, dw 17, 18, 19, 20, 21, 22, 23, 24 -const multiH3, dw 25, 26, 27, 28, 29, 30, 31, 32 +const pw_1, times 16 dw 1 +const pw_2, times 8 dw 2 +const pw_m2, times 8 dw -2 +const pw_4, times 8 dw 4 +const pw_8, times 8 dw 8 +const pw_16, times 16 dw 16 +const pw_15, times 16 dw 15 +const pw_31, times 16 dw 31 +const pw_32, times 16 dw 32 +const pw_64, times 8 dw 64 +const pw_128, times 16 dw 128 +const pw_256, times 16 dw 256 +const pw_257, times 16 dw 257 +const pw_512, times 16 dw 512 +const pw_1023, times 8 dw 1023 +const pw_1024, times 16 dw 1024 +const pw_4096, times 16 dw 4096 +const pw_00ff, times 16 dw 0x00ff +const pw_ff00, times 8 dw 0xff00 +const pw_2000, times 16 dw 0x2000 +const pw_8000, times 8 dw 0x8000 +const pw_3fff, times 8 dw 0x3fff +const pw_32_0, times 4 dw 32, + times 4 dw 0 +const pw_pixel_max, times 16 dw ((1 << BIT_DEPTH)-1) + +const pw_0_15, times 2 dw 0, 1, 2, 3, 4, 5, 6, 7 +const pw_ppppmmmm, times 1 dw 1, 1, 1, 1, -1, -1, -1, -1 +const pw_ppmmppmm, times 1 dw 1, 1, -1, -1, 1, 1, -1, -1 +const pw_pmpmpmpm, times 1 dw 1, -1, 1, -1, 1, -1, 1, -1 +const pw_pmmpzzzz, times 1 dw 1, -1, -1, 1, 0, 0, 0, 0 +const multi_2Row, times 1 dw 1, 2, 3, 4, 1, 2, 3, 4 +const multiH, times 1 dw 9, 10, 11, 12, 13, 14, 15, 16 +const multiH3, times 1 dw 25, 26, 27, 28, 29, 30, 31, 32 +const multiL, times 1 dw 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 +const multiH2, times 1 dw 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32 +const pw_planar16_mul, times 1 dw 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 +const pw_planar32_mul, times 1 dw 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16 +const pw_FFFFFFFFFFFFFFF0, dw 0x00 + times 7 dw 0xff + + +;; 32-bit constants + +const pd_1, times 8 dd 1 +const pd_2, times 8 dd 2 +const pd_4, times 4 dd 4 +const pd_8, times 4 dd 8 +const pd_16, times 4 dd 16 +const pd_32, times 4 dd 32 +const pd_64, times 4 dd 64 +const pd_128, times 4 dd 128 +const pd_256, times 4 dd 256 +const pd_512, times 4 dd 512 +const pd_1024, times 4 dd 1024 +const pd_2048, times 4 dd 2048 +const pd_ffff, times 4 dd 0xffff +const pd_32767, times 4 dd 32767 +const pd_n32768, times 4 dd 0xffff8000 + +const trans8_shuf, times 1 dd 0, 4, 1, 5, 2, 6, 3, 7 +const deinterleave_shufd, times 1 dd 0, 4, 1, 5, 2, 6, 3, 7 const popcnt_table %assign x 0
View file
x265_1.6.tar.gz/source/common/x86/dct8.asm -> x265_1.7.tar.gz/source/common/x86/dct8.asm
Changed
@@ -261,6 +261,11 @@ times 2 dw 84, -29, -74, 55 times 2 dw 55, -84, 74, -29 +pw_dst4_tab: times 4 dw 29, 55, 74, 84 + times 4 dw 74, 74, 0, -74 + times 4 dw 84, -29, -74, 55 + times 4 dw 55, -84, 74, -29 + tab_idst4: times 4 dw 29, +84 times 4 dw +74, +55 times 4 dw 55, -29 @@ -270,6 +275,16 @@ times 4 dw 84, +55 times 4 dw -74, -29 +pw_idst4_tab: times 4 dw 29, 84 + times 4 dw 55, -29 + times 4 dw 74, 55 + times 4 dw 74, -84 + times 4 dw 74, -74 + times 4 dw 84, 55 + times 4 dw 0, 74 + times 4 dw -74, -29 +pb_idst4_shuf: times 2 db 0, 1, 8, 9, 2, 3, 10, 11, 4, 5, 12, 13, 6, 7, 14, 15 + tab_dct8_1: times 2 dw 89, 50, 75, 18 times 2 dw 75, -89, -18, -50 times 2 dw 50, 18, -89, 75 @@ -316,7 +331,7 @@ cextern pd_1024 cextern pd_2048 cextern pw_ppppmmmm - +cextern trans8_shuf ;------------------------------------------------------ ;void dct4(const int16_t* src, int16_t* dst, intptr_t srcStride) ;------------------------------------------------------ @@ -656,6 +671,59 @@ RET +;------------------------------------------------------------------ +;void dst4(const int16_t* src, int16_t* dst, intptr_t srcStride) +;------------------------------------------------------------------ +INIT_YMM avx2 +cglobal dst4, 3, 4, 6 +%if BIT_DEPTH == 8 + %define DST_SHIFT 1 + vpbroadcastd m5, [pd_1] +%elif BIT_DEPTH == 10 + %define DST_SHIFT 3 + vpbroadcastd m5, [pd_4] +%endif + mova m4, [trans8_shuf] + add r2d, r2d + lea r3, [pw_dst4_tab] + + movq xm0, [r0 + 0 * r2] + movhps xm0, [r0 + 1 * r2] + lea r0, [r0 + 2 * r2] + movq xm1, [r0] + movhps xm1, [r0 + r2] + + vinserti128 m0, m0, xm1, 1 ; m0 = src[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] + + pmaddwd m2, m0, [r3 + 0 * 32] + pmaddwd m1, m0, [r3 + 1 * 32] + phaddd m2, m1 + paddd m2, m5 + psrad m2, DST_SHIFT + pmaddwd m3, m0, [r3 + 2 * 32] + pmaddwd m1, m0, [r3 + 3 * 32] + phaddd m3, m1 + paddd m3, m5 + psrad m3, DST_SHIFT + packssdw m2, m3 + vpermd m2, m4, m2 + + vpbroadcastd m5, [pd_128] + pmaddwd m0, m2, [r3 + 0 * 32] + pmaddwd m1, m2, [r3 + 1 * 32] + phaddd m0, m1 + paddd m0, m5 + psrad m0, 8 + pmaddwd m3, m2, [r3 + 2 * 32] + pmaddwd m2, m2, [r3 + 3 * 32] + phaddd m3, m2 + paddd m3, m5 + psrad m3, 8 + packssdw m0, m3 + vpermd m0, m4, m0 + movu [r1], m0 + RET + ;------------------------------------------------------- ;void idst4(const int16_t* src, int16_t* dst, intptr_t dstStride) ;------------------------------------------------------- @@ -748,6 +816,81 @@ movhps [r1 + r2], m1 RET +;----------------------------------------------------------------- +;void idst4(const int16_t* src, int16_t* dst, intptr_t dstStride) +;----------------------------------------------------------------- +INIT_YMM avx2 +cglobal idst4, 3, 4, 6 +%if BIT_DEPTH == 8 + vpbroadcastd m4, [pd_2048] + %define IDCT4_SHIFT 12 +%elif BIT_DEPTH == 10 + vpbroadcastd m4, [pd_512] + %define IDCT4_SHIFT 10 +%else + %error Unsupported BIT_DEPTH! +%endif + add r2d, r2d + lea r3, [pw_idst4_tab] + + movu xm0, [r0 + 0 * 16] + movu xm1, [r0 + 1 * 16] + + punpcklwd m2, m0, m1 + punpckhwd m0, m1 + + vinserti128 m2, m2, xm2, 1 + vinserti128 m0, m0, xm0, 1 + + vpbroadcastd m5, [pd_64] + pmaddwd m1, m2, [r3 + 0 * 32] + pmaddwd m3, m0, [r3 + 1 * 32] + paddd m1, m3 + paddd m1, m5 + psrad m1, 7 + pmaddwd m3, m2, [r3 + 2 * 32] + pmaddwd m0, [r3 + 3 * 32] + paddd m3, m0 + paddd m3, m5 + psrad m3, 7 + + packssdw m0, m1, m3 + pshufb m0, [pb_idst4_shuf] + vpermq m1, m0, 11101110b + + punpcklwd m2, m0, m1 + punpckhwd m0, m1 + punpcklwd m1, m2, m0 + punpckhwd m2, m0 + + vpermq m1, m1, 01000100b + vpermq m2, m2, 01000100b + + pmaddwd m0, m1, [r3 + 0 * 32] + pmaddwd m3, m2, [r3 + 1 * 32] + paddd m0, m3 + paddd m0, m4 + psrad m0, IDCT4_SHIFT + pmaddwd m3, m1, [r3 + 2 * 32] + pmaddwd m2, m2, [r3 + 3 * 32] + paddd m3, m2 + paddd m3, m4 + psrad m3, IDCT4_SHIFT + + packssdw m0, m3 + pshufb m1, m0, [pb_idst4_shuf] + vpermq m0, m1, 11101110b + + punpcklwd m2, m1, m0 + movq [r1 + 0 * r2], xm2 + movhps [r1 + 1 * r2], xm2 + + punpckhwd m1, m0 + movq [r1 + 2 * r2], xm1 + lea r1, [r1 + 2 * r2] + movhps [r1 + r2], xm1 + RET + ;------------------------------------------------------- ; void dct8(const int16_t* src, int16_t* dst, intptr_t srcStride) ;-------------------------------------------------------
View file
x265_1.6.tar.gz/source/common/x86/dct8.h -> x265_1.7.tar.gz/source/common/x86/dct8.h
Changed
@@ -26,6 +26,7 @@ void x265_dct4_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride); void x265_dct8_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride); void x265_dst4_ssse3(const int16_t* src, int16_t* dst, intptr_t srcStride); +void x265_dst4_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride); void x265_dct8_sse4(const int16_t* src, int16_t* dst, intptr_t srcStride); void x265_dct4_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride); void x265_dct8_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride); @@ -33,6 +34,7 @@ void x265_dct32_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride); void x265_idst4_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride); +void x265_idst4_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride); void x265_idct4_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride); void x265_idct4_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride); void x265_idct8_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride);
View file
x265_1.6.tar.gz/source/common/x86/intrapred.h -> x265_1.7.tar.gz/source/common/x86/intrapred.h
Changed
@@ -34,6 +34,7 @@ void x265_intra_pred_dc8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter); void x265_intra_pred_dc16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter); void x265_intra_pred_dc32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter); +void x265_intra_pred_dc32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter); void x265_intra_pred_planar4_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); void x265_intra_pred_planar8_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); @@ -43,6 +44,8 @@ void x265_intra_pred_planar8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); void x265_intra_pred_planar16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); void x265_intra_pred_planar32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); +void x265_intra_pred_planar16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); +void x265_intra_pred_planar32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); #define DECL_ANG(bsize, mode, cpu) \ void x265_intra_pred_ang ## bsize ## _ ## mode ## _ ## cpu(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); @@ -55,6 +58,16 @@ DECL_ANG(4, 7, sse2); DECL_ANG(4, 8, sse2); DECL_ANG(4, 9, sse2); +DECL_ANG(4, 10, sse2); +DECL_ANG(4, 11, sse2); +DECL_ANG(4, 12, sse2); +DECL_ANG(4, 13, sse2); +DECL_ANG(4, 14, sse2); +DECL_ANG(4, 15, sse2); +DECL_ANG(4, 16, sse2); +DECL_ANG(4, 17, sse2); +DECL_ANG(4, 18, sse2); +DECL_ANG(4, 26, sse2); DECL_ANG(4, 2, ssse3); DECL_ANG(4, 3, sse4); @@ -174,6 +187,34 @@ DECL_ANG(32, 33, sse4); #undef DECL_ANG +void x265_intra_pred_ang4_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_14_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_15_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_17_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_19_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_20_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang4_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang8_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang8_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang8_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); @@ -192,6 +233,24 @@ void x265_intra_pred_ang8_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_14_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_15_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_20_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang16_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang16_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang16_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); @@ -212,8 +271,17 @@ void x265_intra_pred_ang32_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang32_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang32_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_18_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_all_angs_pred_4x4_sse2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_32x32_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); +void x265_all_angs_pred_4x4_avx2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); #endif // ifndef X265_INTRAPRED_H
View file
x265_1.6.tar.gz/source/common/x86/intrapred16.asm -> x265_1.7.tar.gz/source/common/x86/intrapred16.asm
Changed
@@ -690,6 +690,508 @@ %endrep RET +;----------------------------------------------------------------------------------------- +; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter) +;----------------------------------------------------------------------------------------- +INIT_XMM sse2 +cglobal intra_pred_ang4_2, 3,5,4 + lea r4, [r2 + 4] + add r2, 20 + cmp r3m, byte 34 + cmove r2, r4 + + add r1, r1 + movu m0, [r2] + movh [r0], m0 + psrldq m0, 2 + movh [r0 + r1], m0 + psrldq m0, 2 + movh [r0 + r1 * 2], m0 + lea r1, [r1 * 3] + psrldq m0, 2 + movh [r0 + r1], m0 + RET + +cglobal intra_pred_ang4_3, 3,5,8 + mov r4d, 2 + cmp r3m, byte 33 + mov r3d, 18 + cmove r3d, r4d + + movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] + + mova m2, m0 + psrldq m0, 2 + punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1] + mova m3, m0 + psrldq m0, 2 + punpcklwd m3, m0 ; [6 5 5 4 4 3 3 2] + mova m4, m0 + psrldq m0, 2 + punpcklwd m4, m0 ; [7 6 6 5 5 4 4 3] + mova m5, m0 + psrldq m0, 2 + punpcklwd m5, m0 ; [8 7 7 6 6 5 5 4] + + + lea r3, [ang_table + 20 * 16] + mova m0, [r3 + 6 * 16] ; [26] + mova m1, [r3] ; [20] + mova m6, [r3 - 6 * 16] ; [14] + mova m7, [r3 - 12 * 16] ; [ 8] + jmp .do_filter4x4 + + +ALIGN 16 +.do_filter4x4: + lea r4, [pd_16] + pmaddwd m2, m0 + paddd m2, [r4] + psrld m2, 5 + + pmaddwd m3, m1 + paddd m3, [r4] + psrld m3, 5 + packssdw m2, m3 + + pmaddwd m4, m6 + paddd m4, [r4] + psrld m4, 5 + + pmaddwd m5, m7 + paddd m5, [r4] + psrld m5, 5 + packssdw m4, m5 + + jz .store + + ; transpose 4x4 + punpckhwd m0, m2, m4 + punpcklwd m2, m4 + punpckhwd m4, m2, m0 + punpcklwd m2, m0 + +.store: + add r1, r1 + movh [r0], m2 + movhps [r0 + r1], m2 + movh [r0 + r1 * 2], m4 + lea r1, [r1 * 3] + movhps [r0 + r1], m4 + RET + +cglobal intra_pred_ang4_4, 3,5,8 + mov r4d, 2 + cmp r3m, byte 32 + mov r3d, 18 + cmove r3d, r4d + + movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] + mova m2, m0 + psrldq m0, 2 + punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1] + mova m3, m0 + psrldq m0, 2 + punpcklwd m3, m0 ; [6 5 5 4 4 3 3 2] + mova m4, m3 + mova m5, m0 + psrldq m0, 2 + punpcklwd m5, m0 ; [7 6 6 5 5 4 4 3] + + lea r3, [ang_table + 18 * 16] + mova m0, [r3 + 3 * 16] ; [21] + mova m1, [r3 - 8 * 16] ; [10] + mova m6, [r3 + 13 * 16] ; [31] + mova m7, [r3 + 2 * 16] ; [20] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_5, 3,5,8 + mov r4d, 2 + cmp r3m, byte 31 + mov r3d, 18 + cmove r3d, r4d + + movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] + mova m2, m0 + psrldq m0, 2 + punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1] + mova m3, m0 + psrldq m0, 2 + punpcklwd m3, m0 ; [6 5 5 4 4 3 3 2] + mova m4, m3 + mova m5, m0 + psrldq m0, 2 + punpcklwd m5, m0 ; [7 6 6 5 5 4 4 3] + + lea r3, [ang_table + 10 * 16] + mova m0, [r3 + 7 * 16] ; [17] + mova m1, [r3 - 8 * 16] ; [ 2] + mova m6, [r3 + 9 * 16] ; [19] + mova m7, [r3 - 6 * 16] ; [ 4] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_6, 3,5,8 + mov r4d, 2 + cmp r3m, byte 30 + mov r3d, 18 + cmove r3d, r4d + + movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] + mova m2, m0 + psrldq m0, 2 + punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1] + mova m3, m2 + mova m4, m0 + psrldq m0, 2 + punpcklwd m4, m0 ; [6 5 5 4 4 3 3 2] + mova m5, m4 + + lea r3, [ang_table + 19 * 16] + mova m0, [r3 - 6 * 16] ; [13] + mova m1, [r3 + 7 * 16] ; [26] + mova m6, [r3 - 12 * 16] ; [ 7] + mova m7, [r3 + 1 * 16] ; [20] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_7, 3,5,8 + mov r4d, 2 + cmp r3m, byte 29 + mov r3d, 18 + cmove r3d, r4d + + movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] + mova m2, m0 + psrldq m0, 2 + punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1] + mova m3, m2 + mova m4, m2 + mova m5, m0 + psrldq m0, 2 + punpcklwd m5, m0 ; [6 5 5 4 4 3 3 2] + + lea r3, [ang_table + 20 * 16] + mova m0, [r3 - 11 * 16] ; [ 9] + mova m1, [r3 - 2 * 16] ; [18] + mova m6, [r3 + 7 * 16] ; [27] + mova m7, [r3 - 16 * 16] ; [ 4] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_8, 3,5,8 + mov r4d, 2 + cmp r3m, byte 28 + mov r3d, 18 + cmove r3d, r4d + + movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] + mova m2, m0 + psrldq m0, 2 + punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1] + mova m3, m2 + mova m4, m2 + mova m5, m2 + + lea r3, [ang_table + 13 * 16] + mova m0, [r3 - 8 * 16] ; [ 5] + mova m1, [r3 - 3 * 16] ; [10] + mova m6, [r3 + 2 * 16] ; [15] + mova m7, [r3 + 7 * 16] ; [20] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_9, 3,5,8 + mov r4d, 2 + cmp r3m, byte 27 + mov r3d, 18 + cmove r3d, r4d + + movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] + mova m2, m0 + psrldq m0, 2 + punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1] + mova m3, m2 + mova m4, m2 + mova m5, m2 + + lea r3, [ang_table + 4 * 16] + mova m0, [r3 - 2 * 16] ; [ 2] + mova m1, [r3 - 0 * 16] ; [ 4] + mova m6, [r3 + 2 * 16] ; [ 6] + mova m7, [r3 + 4 * 16] ; [ 8] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_10, 3,3,3 + movh m0, [r2 + 18] ; [4 3 2 1] + + punpcklwd m0, m0 ;[4 4 3 3 2 2 1 1] + pshufd m1, m0, 0xFA + add r1, r1 + pshufd m0, m0, 0x50 + movhps [r0 + r1], m0 + movh [r0 + r1 * 2], m1 + lea r1, [r1 * 3] + movhps [r0 + r1], m1 + + cmp r4m, byte 0 + jz .quit + + ; filter + movd m2, [r2] ; [7 6 5 4 3 2 1 0] + pshuflw m2, m2, 0x00 + movh m1, [r2 + 2] + psubw m1, m2 + psraw m1, 1 + paddw m0, m1 + pxor m1, m1 + pmaxsw m0, m1 + pminsw m0, [pw_1023] +.quit: + movh [r0], m0 + RET + +cglobal intra_pred_ang4_26, 3,3,3 + movh m0, [r2 + 2] ; [8 7 6 5 4 3 2 1] + add r1d, r1d + ; store + movh [r0], m0 + movh [r0 + r1], m0 + movh [r0 + r1 * 2], m0 + lea r3, [r1 * 3] + movh [r0 + r3], m0 + + ; filter + cmp r4m, byte 0 + jz .quit + + pshuflw m0, m0, 0x00 + movd m2, [r2] + pshuflw m2, m2, 0x00 + movh m1, [r2 + 18] + psubw m1, m2 + psraw m1, 1 + paddw m0, m1 + pxor m1, m1 + pmaxsw m0, m1 + pminsw m0, [pw_1023] + + movh r2, m0 + mov [r0], r2w + shr r2, 16 + mov [r0 + r1], r2w + shr r2, 16 + mov [r0 + r1 * 2], r2w + shr r2, 16 + mov [r0 + r3], r2w +.quit: + RET + +cglobal intra_pred_ang4_11, 3,5,8 + xor r4d, r4d + cmp r3m, byte 25 + mov r3d, 16 + cmove r3d, r4d + + movh m1, [r2 + r3 + 2] ; [x x x 4 3 2 1 0] + movh m2, [r2 - 6] + punpcklqdq m2, m1 + psrldq m2, 6 + punpcklwd m2, m1 ; [4 3 3 2 2 1 1 0] + mova m3, m2 + mova m4, m2 + mova m5, m2 + + lea r3, [ang_table + 24 * 16] + mova m0, [r3 + 6 * 16] ; [24] + mova m1, [r3 + 4 * 16] ; [26] + mova m6, [r3 + 2 * 16] ; [28] + mova m7, [r3 + 0 * 16] ; [30] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_12, 3,5,8 + xor r4d, r4d + cmp r3m, byte 24 + mov r3d, 16 + cmove r3d, r4d + + movh m1, [r2 + r3 + 2] + movh m2, [r2 - 6] + punpcklqdq m2, m1 + psrldq m2, 6 + punpcklwd m2, m1 ; [4 3 3 2 2 1 1 0] + mova m3, m2 + mova m4, m2 + mova m5, m2 + + lea r3, [ang_table + 20 * 16] + mova m0, [r3 + 7 * 16] ; [27] + mova m1, [r3 + 2 * 16] ; [22] + mova m6, [r3 - 3 * 16] ; [17] + mova m7, [r3 - 8 * 16] ; [12] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_13, 3,5,8 + xor r4d, r4d + cmp r3m, byte 23 + mov r3d, 16 + jz .next + xchg r3d, r4d +.next: + movd m5, [r2 + r3 + 6] + movd m2, [r2 - 2] + movh m0, [r2 + r4 + 2] + punpcklwd m5, m2 + punpcklqdq m5, m0 + psrldq m5, 4 + mova m2, m5 + psrldq m2, 2 + punpcklwd m5, m2 ; [3 2 2 1 1 0 0 x] + punpcklwd m2, m0 ; [4 3 3 2 2 1 1 0] + mova m3, m2 + mova m4, m2 + + lea r3, [ang_table + 21 * 16] + mova m0, [r3 + 2 * 16] ; [23] + mova m1, [r3 - 7 * 16] ; [14] + mova m6, [r3 - 16 * 16] ; [ 5] + mova m7, [r3 + 7 * 16] ; [28] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_14, 3,5,8 + xor r4d, r4d + cmp r3m, byte 22 + mov r3d, 16 + jz .next + xchg r3d, r4d +.next: + movd m5, [r2 + r3 + 2] + movd m2, [r2 - 2] + movh m0, [r2 + r4 + 2] + punpcklwd m5, m2 + punpcklqdq m5, m0 + psrldq m5, 4 + mova m2, m5 + psrldq m2, 2 + punpcklwd m5, m2 ; [3 2 2 1 1 0 0 x] + punpcklwd m2, m0 ; [4 3 3 2 2 1 1 0] + mova m3, m2 + mova m4, m5 + + lea r3, [ang_table + 19 * 16] + mova m0, [r3 + 0 * 16] ; [19] + mova m1, [r3 - 13 * 16] ; [ 6] + mova m6, [r3 + 6 * 16] ; [25] + mova m7, [r3 - 7 * 16] ; [12] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_15, 3,5,8 + xor r4d, r4d + cmp r3m, byte 21 + mov r3d, 16 + jz .next + xchg r3d, r4d +.next: + movd m4, [r2] ;[x x x A] + movh m5, [r2 + r3 + 4] ;[x C x B] + movh m0, [r2 + r4 + 2] ;[4 3 2 1] + pshuflw m5, m5, 0x22 ;[B C B C] + punpcklqdq m5, m4 ;[x x x A B C B C] + psrldq m5, 2 ;[x x x x A B C B] + punpcklqdq m5, m0 + psrldq m5, 2 + mova m2, m5 + mova m3, m5 + psrldq m2, 4 + psrldq m3, 2 + punpcklwd m5, m3 ; [2 1 1 0 0 x x y] + punpcklwd m3, m2 ; [3 2 2 1 1 0 0 x] + punpcklwd m2, m0 ; [4 3 3 2 2 1 1 0] + mova m4, m3 + + lea r3, [ang_table + 23 * 16] + mova m0, [r3 - 8 * 16] ; [15] + mova m1, [r3 + 7 * 16] ; [30] + mova m6, [r3 - 10 * 16] ; [13] + mova m7, [r3 + 5 * 16] ; [28] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_16, 3,5,8 + xor r4d, r4d + cmp r3m, byte 20 + mov r3d, 16 + jz .next + xchg r3d, r4d +.next: + movd m4, [r2] ;[x x x A] + movd m5, [r2 + r3 + 4] ;[x x C B] + movh m0, [r2 + r4 + 2] ;[4 3 2 1] + punpcklwd m5, m4 ;[x C A B] + pshuflw m5, m5, 0x4A ;[A B C C] + punpcklqdq m5, m0 ;[4 3 2 1 A B C C] + psrldq m5, 2 + mova m2, m5 + mova m3, m5 + psrldq m2, 4 + psrldq m3, 2 + punpcklwd m5, m3 ; [2 1 1 0 0 x x y] + punpcklwd m3, m2 ; [3 2 2 1 1 0 0 x] + punpcklwd m2, m0 ; [4 3 3 2 2 1 1 0] + mova m4, m3 + + lea r3, [ang_table + 19 * 16] + mova m0, [r3 - 8 * 16] ; [11] + mova m1, [r3 + 3 * 16] ; [22] + mova m6, [r3 - 18 * 16] ; [ 1] + mova m7, [r3 - 7 * 16] ; [12] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_17, 3,5,8 + xor r4d, r4d + cmp r3m, byte 19 + mov r3d, 16 + jz .next + xchg r3d, r4d +.next: + movd m4, [r2] + movh m5, [r2 + r3 + 2] ;[D x C B] + pshuflw m5, m5, 0x1F ;[B C D D] + punpcklqdq m5, m4 ;[x x x A B C D D] + psrldq m5, 2 ;[x x x x A B C D] + movhps m5, [r2 + r4 + 2] + + mova m4, m5 + psrldq m4, 2 + punpcklwd m5, m4 + mova m3, m4 + psrldq m3, 2 + punpcklwd m4, m3 + mova m2, m3 + psrldq m2, 2 + punpcklwd m3, m2 + mova m1, m2 + psrldq m1, 2 + punpcklwd m2, m1 + + lea r3, [ang_table + 14 * 16] + mova m0, [r3 - 8 * 16] ; [ 6] + mova m1, [r3 - 2 * 16] ; [12] + mova m6, [r3 + 4 * 16] ; [18] + mova m7, [r3 + 10 * 16] ; [24] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_18, 3,3,1 + movh m0, [r2 + 16] + pinsrw m0, [r2], 0 + pshuflw m0, m0, q0123 + movhps m0, [r2 + 2] + add r1, r1 + lea r2, [r1 * 3] + movh [r0 + r2], m0 + psrldq m0, 2 + movh [r0 + r1 * 2], m0 + psrldq m0, 2 + movh [r0 + r1], m0 + psrldq m0, 2 + movh [r0], m0 + RET + ;----------------------------------------------------------------------------------- ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter) ;-----------------------------------------------------------------------------------
View file
x265_1.6.tar.gz/source/common/x86/intrapred8.asm -> x265_1.7.tar.gz/source/common/x86/intrapred8.asm
Changed
@@ -28,6 +28,7 @@ SECTION_RODATA 32 intra_pred_shuff_0_8: times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 +intra_pred_shuff_15_0: times 2 db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 pb_0_8 times 8 db 0, 8 pb_unpackbw1 times 2 db 1, 8, 2, 8, 3, 8, 4, 8 @@ -58,7 +59,6 @@ c_mode16_18: db 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 ALIGN 32 -trans8_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7 c_ang8_src1_9_2_10: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9 c_ang8_26_20: db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 c_ang8_src3_11_4_12: db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11 @@ -124,6 +124,37 @@ db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 +ALIGN 32 +c_ang16_mode_11: db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 + db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 + db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 + db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + + +ALIGN 32 +c_ang16_mode_12: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 + db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9 + db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 + db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 + db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + + +ALIGN 32 +c_ang16_mode_13: db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29 + db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 + db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25 + db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 ALIGN 32 c_ang16_mode_28: db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 @@ -135,6 +166,15 @@ db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 +ALIGN 32 +c_ang16_mode_9: db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 + db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 + db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 ALIGN 32 c_ang16_mode_27: db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 @@ -150,6 +190,15 @@ ALIGN 32 intra_pred_shuff_0_15: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 15 +ALIGN 32 +c_ang16_mode_8: db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23 + db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1 + db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 ALIGN 32 c_ang16_mode_29: db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 @@ -162,6 +211,15 @@ db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 +ALIGN 32 +c_ang16_mode_7: db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 + db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3 + db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 + db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 ALIGN 32 c_ang16_mode_30: db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 @@ -175,6 +233,17 @@ db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + +ALIGN 32 +c_ang16_mode_6: db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 + db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 + db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 + db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 + db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + ALIGN 32 c_ang16_mode_31: db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19 @@ -186,6 +255,17 @@ db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + +ALIGN 32 +c_ang16_mode_5: db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 + db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 + db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 + db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + ALIGN 32 c_ang16_mode_32: db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 @@ -200,6 +280,16 @@ db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 ALIGN 32 +c_ang16_mode_4: db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7 + db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 + db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + +ALIGN 32 c_ang16_mode_33: db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 @@ -216,6 +306,16 @@ db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 ALIGN 32 +c_ang16_mode_3: db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 + db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + +ALIGN 32 c_ang16_mode_24: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 @@ -376,6 +476,191 @@ db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11 db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + +ALIGN 32 +c_ang32_mode_33: db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 + db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 + db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 + db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 + db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 + db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 + db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 + db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 + db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + + + +ALIGN 32 +c_ang32_mode_25: db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + + + +ALIGN 32 +c_ang32_mode_24: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 + db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 + db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 + db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 + db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1 + db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3 + db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25 + db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5 + db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + + +ALIGN 32 +c_ang32_mode_23: db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 + db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5 + db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1 + db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 + db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7 + db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 + db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3 + db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 + db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + + +ALIGN 32 +c_ang32_mode_22: db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5 + db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11 + db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 + db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 + db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9 + db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 + db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 + db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1 + db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7 + db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13 + db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + +ALIGN 32 +c_ang32_mode_21: db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 + db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13 + db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11 + db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9 + db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5 + db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1 + db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 + db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 + db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 + db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + + +ALIGN 32 +intra_pred_shuff_0_4: times 4 db 0, 1, 1, 2, 2, 3, 3, 4 +intra_pred4_shuff1: db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5 +intra_pred4_shuff2: db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5 +intra_pred4_shuff31: db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6 +intra_pred4_shuff33: db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7 +intra_pred4_shuff3: db 8, 9, 9, 10, 10, 11, 11, 12, 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14, 11, 12, 12, 13, 13, 14, 14, 15 +intra_pred4_shuff4: db 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14, 10, 11, 11, 12, 12, 13, 13, 14, 11, 12, 12, 13, 13, 14, 14, 15 +intra_pred4_shuff5: db 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14, 10, 11, 11, 12, 12, 13, 13, 14, 11, 12, 12, 13, 13, 14, 14, 15 +intra_pred4_shuff6: db 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14, 10, 11, 11, 12, 12, 13, 13, 14 +intra_pred4_shuff7: db 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14 +intra_pred4_shuff9: db 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13 +intra_pred4_shuff12: db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12,0, 9, 9, 10, 10, 11, 11, 12 +intra_pred4_shuff13: db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 4, 0, 0, 9, 9, 10, 10, 11 +intra_pred4_shuff14: db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11 +intra_pred4_shuff15: db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 4, 2, 2, 0, 0, 9, 9, 10 +intra_pred4_shuff16: db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 3, 2, 2, 0, 0, 9, 9, 10 +intra_pred4_shuff17: db 0, 9, 9, 10, 10, 11, 11, 12, 1, 0, 0, 9, 9, 10, 10, 11, 2, 1, 1, 0, 0, 9, 9, 10, 4, 2, 2, 1, 1, 0, 0, 9 +intra_pred4_shuff19: db 0, 1, 1, 2, 2, 3, 3, 4, 9, 0, 0, 1, 1, 2, 2, 3, 10, 9, 9, 0, 0, 1, 1, 2, 12, 10, 10, 9, 9, 0, 0, 1 +intra_pred4_shuff20: db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 11, 10, 10, 0, 0, 1, 1, 2 +intra_pred4_shuff21: db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 12, 10, 10, 0, 0, 1, 1, 2 +intra_pred4_shuff22: db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3 +intra_pred4_shuff23: db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 12, 0, 0, 1, 1, 2, 2, 3 + +c_ang4_mode_27: db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8 +c_ang4_mode_28: db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20 +c_ang4_mode_29: db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4 +c_ang4_mode_30: db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20 +c_ang4_mode_31: db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4 +c_ang4_mode_32: db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20 +c_ang4_mode_33: db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8 +c_ang4_mode_5: db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4 +c_ang4_mode_6: db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20 +c_ang4_mode_7: db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4 +c_ang4_mode_8: db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20 +c_ang4_mode_9: db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8 +c_ang4_mode_11: db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24 +c_ang4_mode_12: db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12 +c_ang4_mode_13: db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28 +c_ang4_mode_14: db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12 +c_ang4_mode_15: db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28, 4 +c_ang4_mode_16: db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12 +c_ang4_mode_17: db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24 +c_ang4_mode_19: db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24 +c_ang4_mode_20: db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12 +c_ang4_mode_21: db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28 +c_ang4_mode_22: db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12 +c_ang4_mode_23: db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28 +c_ang4_mode_24: db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12 +c_ang4_mode_25: db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24 + ALIGN 32 ;; (blkSize - 1 - x) pw_planar4_0: dw 3, 2, 1, 0, 3, 2, 1, 0 @@ -388,6 +673,29 @@ pw_planar32_L: dw 31, 30, 29, 28, 27, 26, 25, 24 pw_planar32_H: dw 23, 22, 21, 20, 19, 18, 17, 16 +ALIGN 32 +c_ang8_mode_13: db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 + db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 + db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + +ALIGN 32 +c_ang8_mode_14: db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + +ALIGN 32 +c_ang8_mode_15: db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 + db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 + db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + +ALIGN 32 +c_ang8_mode_20: db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 + db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 + db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 const ang_table %assign x 0 @@ -409,8 +717,11 @@ cextern pw_4 cextern pw_8 cextern pw_16 +cextern pw_15 +cextern pw_31 cextern pw_32 cextern pw_257 +cextern pw_512 cextern pw_1024 cextern pw_4096 cextern pw_00ff @@ -420,6 +731,9 @@ cextern multiH2 cextern multiH3 cextern multi_2Row +cextern trans8_shuf +cextern pw_planar16_mul +cextern pw_planar32_mul ;--------------------------------------------------------------------------------------------- ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter) @@ -1249,7 +1563,7 @@ ; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter) ;----------------------------------------------------------------------------------------- INIT_XMM sse2 -cglobal intra_pred_ang4_2, 3,5,3 +cglobal intra_pred_ang4_2, 3,5,1 lea r4, [r2 + 2] add r2, 10 cmp r3m, byte 34 @@ -1257,23 +1571,21 @@ movh m0, [r2] movd [r0], m0 - mova m1, m0 - psrldq m1, 1 - movd [r0 + r1], m1 - mova m2, m0 - psrldq m2, 2 - movd [r0 + r1 * 2], m2 + psrldq m0, 1 + movd [r0 + r1], m0 + psrldq m0, 1 + movd [r0 + r1 * 2], m0 lea r1, [r1 * 3] - psrldq m0, 3 + psrldq m0, 1 movd [r0 + r1], m0 RET INIT_XMM sse2 cglobal intra_pred_ang4_3, 3,5,8 - mov r4, 1 + mov r4d, 1 cmp r3m, byte 33 - mov r3, 9 - cmove r3, r4 + mov r3d, 9 + cmove r3d, r4d movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] mova m1, m0 @@ -1299,7 +1611,6 @@ ALIGN 16 .do_filter4x4: pxor m1, m1 - pxor m3, m3 punpckhbw m3, m0 psrlw m3, 8 pmaddwd m3, m5 @@ -1308,7 +1619,6 @@ packssdw m0, m3 paddw m0, [pw_16] psraw m0, 5 - pxor m3, m3 punpckhbw m3, m2 psrlw m3, 8 pmaddwd m3, m7 @@ -1335,32 +1645,31 @@ .store: packuswb m0, m2 movd [r0], m0 - pshufd m0, m0, 0x39 + psrldq m0, 4 movd [r0 + r1], m0 - pshufd m0, m0, 0x39 + psrldq m0, 4 movd [r0 + r1 * 2], m0 lea r1, [r1 * 3] - pshufd m0, m0, 0x39 + psrldq m0, 4 movd [r0 + r1], m0 RET cglobal intra_pred_ang4_4, 3,5,8 - xor r4, r4 - inc r4 + xor r4d, r4d + inc r4d cmp r3m, byte 32 - mov r3, 9 - cmove r3, r4 + mov r3d, 9 + cmove r3d, r4d movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] + punpcklbw m0, m0 + psrldq m0, 1 + mova m2, m0 + psrldq m2, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2] mova m1, m0 - psrldq m1, 1 ; [x 8 7 6 5 4 3 2] - punpcklbw m0, m1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] - mova m1, m0 - psrldq m1, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2] - mova m3, m0 - psrldq m3, 4 ; [x x x x x x x x 7 6 6 5 5 4 4 3] - punpcklqdq m0, m1 - punpcklqdq m2, m1, m3 + psrldq m1, 4 ; [x x x x x x x x 7 6 6 5 5 4 4 3] + punpcklqdq m0, m2 + punpcklqdq m2, m1 lea r3, [pw_ang_table + 18 * 16] mova m4, [r3 + 3 * 16] ; [21] @@ -1370,22 +1679,21 @@ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) cglobal intra_pred_ang4_5, 3,5,8 - xor r4, r4 - inc r4 + xor r4d, r4d + inc r4d cmp r3m, byte 31 - mov r3, 9 - cmove r3, r4 + mov r3d, 9 + cmove r3d, r4d movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - mova m1, m0 - psrldq m1, 1 ; [x 8 7 6 5 4 3 2] - punpcklbw m0, m1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] - mova m1, m0 - psrldq m1, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2] + punpcklbw m0, m0 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + psrldq m0, 1 + mova m2, m0 + psrldq m2, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2] mova m3, m0 psrldq m3, 4 ; [x x x x x x x x 7 6 6 5 5 4 4 3] - punpcklqdq m0, m1 - punpcklqdq m2, m1, m3 + punpcklqdq m0, m2 + punpcklqdq m2, m3 lea r3, [pw_ang_table + 10 * 16] mova m4, [r3 + 7 * 16] ; [17] @@ -1395,18 +1703,17 @@ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) cglobal intra_pred_ang4_6, 3,5,8 - xor r4, r4 - inc r4 + xor r4d, r4d + inc r4d cmp r3m, byte 30 - mov r3, 9 - cmove r3, r4 + mov r3d, 9 + cmove r3d, r4d movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - mova m1, m0 - psrldq m1, 1 ; [x 8 7 6 5 4 3 2] - punpcklbw m0, m1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + punpcklbw m0, m0 + psrldq m0, 1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] mova m2, m0 - psrldq m2, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2] + psrldq m2, 2 ; [x x x 8 8 7 7 6 6 5 5 4 4 3 3 2] punpcklqdq m0, m0 punpcklqdq m2, m2 @@ -1418,20 +1725,20 @@ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) cglobal intra_pred_ang4_7, 3,5,8 - xor r4, r4 - inc r4 + xor r4d, r4d + inc r4d cmp r3m, byte 29 - mov r3, 9 - cmove r3, r4 + mov r3d, 9 + cmove r3d, r4d movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - mova m1, m0 - psrldq m1, 1 ; [x 8 7 6 5 4 3 2] - punpcklbw m0, m1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] - mova m3, m0 - psrldq m3, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2] - punpcklqdq m2, m0, m3 + punpcklbw m0, m0 + psrldq m0, 1 + mova m2, m0 + psrldq m2, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2] punpcklqdq m0, m0 + punpcklqdq m2, m2 + movhlps m2, m0 lea r3, [pw_ang_table + 20 * 16] mova m4, [r3 - 11 * 16] ; [ 9] @@ -1441,16 +1748,15 @@ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) cglobal intra_pred_ang4_8, 3,5,8 - xor r4, r4 - inc r4 + xor r4d, r4d + inc r4d cmp r3m, byte 28 - mov r3, 9 - cmove r3, r4 + mov r3d, 9 + cmove r3d, r4d movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - mova m1, m0 - psrldq m1, 1 ; [x 8 7 6 5 4 3 2] - punpcklbw m0, m1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + punpcklbw m0, m0 + psrldq m0, 1 punpcklqdq m0, m0 mova m2, m0 @@ -1462,16 +1768,15 @@ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) cglobal intra_pred_ang4_9, 3,5,8 - xor r4, r4 - inc r4 + xor r4d, r4d + inc r4d cmp r3m, byte 27 - mov r3, 9 - cmove r3, r4 + mov r3d, 9 + cmove r3d, r4d movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - mova m1, m0 - psrldq m1, 1 ; [x 8 7 6 5 4 3 2] - punpcklbw m0, m1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + punpcklbw m0, m0 + psrldq m0, 1 ; [x 8 7 6 5 4 3 2] punpcklqdq m0, m0 mova m2, m0 @@ -1482,6 +1787,292 @@ mova m7, [r3 + 4 * 16] ; [ 8] jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) +cglobal intra_pred_ang4_10, 3,5,4 + movd m0, [r2 + 9] ; [8 7 6 5 4 3 2 1] + punpcklbw m0, m0 + punpcklwd m0, m0 + pshufd m1, m0, 1 + movhlps m2, m0 + pshufd m3, m0, 3 + movd [r0 + r1], m1 + movd [r0 + r1 * 2], m2 + lea r1, [r1 * 3] + movd [r0 + r1], m3 + cmp r4m, byte 0 + jz .quit + + ; filter + pxor m3, m3 + punpcklbw m0, m3 + movh m1, [r2] ; [4 3 2 1 0] + punpcklbw m1, m3 + pshuflw m2, m1, 0x00 + psrldq m1, 2 + psubw m1, m2 + psraw m1, 1 + paddw m0, m1 + packuswb m0, m0 + +.quit: + movd [r0], m0 + RET + +cglobal intra_pred_ang4_26, 3,4,4 + movd m0, [r2 + 1] ; [8 7 6 5 4 3 2 1] + + ; store + movd [r0], m0 + movd [r0 + r1], m0 + movd [r0 + r1 * 2], m0 + lea r3, [r1 * 3] + movd [r0 + r3], m0 + + ; filter + cmp r4m, byte 0 + jz .quit + + pxor m3, m3 + punpcklbw m0, m3 + pshuflw m0, m0, 0x00 + movd m2, [r2] + punpcklbw m2, m3 + pshuflw m2, m2, 0x00 + movd m1, [r2 + 9] + punpcklbw m1, m3 + psubw m1, m2 + psraw m1, 1 + paddw m0, m1 + packuswb m0, m0 + + movd r2, m0 + mov [r0], r2b + shr r2, 8 + mov [r0 + r1], r2b + shr r2, 8 + mov [r0 + r1 * 2], r2b + shr r2, 8 + mov [r0 + r3], r2b + +.quit: + RET + +cglobal intra_pred_ang4_11, 3,5,8 + xor r4d, r4d + cmp r3m, byte 25 + mov r3d, 8 + cmove r3d, r4d + + movd m1, [r2 + r3 + 1] ;[4 3 2 1] + movh m0, [r2 - 7] ;[A x x x x x x x] + punpcklbw m1, m1 ;[4 4 3 3 2 2 1 1] + punpcklqdq m0, m1 ;[4 4 3 3 2 2 1 1 A x x x x x x x]] + psrldq m0, 7 ;[x x x x x x x x 4 3 3 2 2 1 1 A] + punpcklqdq m0, m0 + mova m2, m0 + + lea r3, [pw_ang_table + 24 * 16] + + mova m4, [r3 + 6 * 16] ; [24] + mova m5, [r3 + 4 * 16] ; [26] + mova m6, [r3 + 2 * 16] ; [28] + mova m7, [r3 + 0 * 16] ; [30] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_12, 3,5,8 + xor r4d, r4d + cmp r3m, byte 24 + mov r3d, 8 + cmove r3d, r4d + + movd m1, [r2 + r3 + 1] + movh m0, [r2 - 7] + punpcklbw m1, m1 + punpcklqdq m0, m1 + psrldq m0, 7 + punpcklqdq m0, m0 + mova m2, m0 + + lea r3, [pw_ang_table + 20 * 16] + mova m4, [r3 + 7 * 16] ; [27] + mova m5, [r3 + 2 * 16] ; [22] + mova m6, [r3 - 3 * 16] ; [17] + mova m7, [r3 - 8 * 16] ; [12] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_13, 3,5,8 + xor r4d, r4d + cmp r3m, byte 23 + mov r3d, 8 + jz .next + xchg r3d, r4d + +.next: + movd m1, [r2 - 1] ;[x x A x] + movd m2, [r2 + r4 + 1] ;[4 3 2 1] + movd m0, [r2 + r3 + 3] ;[x x B x] + punpcklbw m0, m1 ;[x x x x A B x x] + punpckldq m0, m2 ;[4 3 2 1 A B x x] + psrldq m0, 2 ;[x x 4 3 2 1 A B] + punpcklbw m0, m0 ;[x x x x 4 4 3 3 2 2 1 1 A A B B] + mova m1, m0 + psrldq m0, 3 ;[x x x x x x x 4 4 3 3 2 2 1 1 A] + psrldq m1, 1 ;[x x x x x 4 4 3 3 2 2 1 1 A A B] + movh m2, m0 + punpcklqdq m0, m0 + punpcklqdq m2, m1 + + lea r3, [pw_ang_table + 21 * 16] + mova m4, [r3 + 2 * 16] ; [23] + mova m5, [r3 - 7 * 16] ; [14] + mova m6, [r3 - 16 * 16] ; [ 5] + mova m7, [r3 + 7 * 16] ; [28] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_14, 3,5,8 + xor r4d, r4d + cmp r3m, byte 22 + mov r3d, 8 + jz .next + xchg r3d, r4d + +.next: + movd m1, [r2 - 1] ;[x x A x] + movd m0, [r2 + r3 + 1] ;[x x B x] + punpcklbw m0, m1 ;[A B x x] + movd m1, [r2 + r4 + 1] ;[4 3 2 1] + punpckldq m0, m1 ;[4 3 2 1 A B x x] + psrldq m0, 2 ;[x x 4 3 2 1 A B] + punpcklbw m0, m0 ;[x x x x 4 4 3 3 2 2 1 1 A A B B] + mova m2, m0 + psrldq m0, 3 ;[x x x x x x x 4 4 3 3 2 2 1 1 A] + psrldq m2, 1 ;[x x x x x 4 4 3 3 2 2 1 1 A A B] + punpcklqdq m0, m0 + punpcklqdq m2, m2 + + lea r3, [pw_ang_table + 19 * 16] + mova m4, [r3 + 0 * 16] ; [19] + mova m5, [r3 - 13 * 16] ; [ 6] + mova m6, [r3 + 6 * 16] ; [25] + mova m7, [r3 - 7 * 16] ; [12] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_15, 3,5,8 + xor r4d, r4d + cmp r3m, byte 21 + mov r3d, 8 + jz .next + xchg r3d, r4d + +.next: + movd m0, [r2] ;[x x x A] + movd m1, [r2 + r3 + 2] ;[x x x B] + punpcklbw m1, m0 ;[x x A B] + movd m0, [r2 + r3 + 3] ;[x x C x] + punpcklwd m0, m1 ;[A B C x] + movd m1, [r2 + r4 + 1] ;[4 3 2 1] + punpckldq m0, m1 ;[4 3 2 1 A B C x] + psrldq m0, 1 ;[x 4 3 2 1 A B C] + punpcklbw m0, m0 ;[x x 4 4 3 3 2 2 1 1 A A B B C C] + psrldq m0, 1 + movh m1, m0 + psrldq m0, 2 + movh m2, m0 + psrldq m0, 2 + punpcklqdq m0, m2 + punpcklqdq m2, m1 + + lea r3, [pw_ang_table + 23 * 16] + mova m4, [r3 - 8 * 16] ; [15] + mova m5, [r3 + 7 * 16] ; [30] + mova m6, [r3 - 10 * 16] ; [13] + mova m7, [r3 + 5 * 16] ; [28] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_16, 3,5,8 + xor r4d, r4d + cmp r3m, byte 20 + mov r3d, 8 + jz .next + xchg r3d, r4d + +.next: + movd m2, [r2] ;[x x x A] + movd m1, [r2 + r3 + 2] ;[x x x B] + punpcklbw m1, m2 ;[x x A B] + movh m0, [r2 + r3 + 2] ;[x x C x] + punpcklwd m0, m1 ;[A B C x] + movd m1, [r2 + r4 + 1] ;[4 3 2 1] + punpckldq m0, m1 ;[4 3 2 1 A B C x] + psrldq m0, 1 ;[x 4 3 2 1 A B C] + punpcklbw m0, m0 ;[x x 4 4 3 3 2 2 1 1 A A B B C C] + psrldq m0, 1 + movh m1, m0 + psrldq m0, 2 + movh m2, m0 + psrldq m0, 2 + punpcklqdq m0, m2 + punpcklqdq m2, m1 + + lea r3, [pw_ang_table + 19 * 16] + mova m4, [r3 - 8 * 16] ; [11] + mova m5, [r3 + 3 * 16] ; [22] + mova m6, [r3 - 18 * 16] ; [ 1] + mova m7, [r3 - 7 * 16] ; [12] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_17, 3,5,8 + xor r4d, r4d + cmp r3m, byte 19 + mov r3d, 8 + jz .next + xchg r3d, r4d + +.next: + movd m2, [r2] ;[x x x A] + movd m3, [r2 + r3 + 1] ;[x x x B] + movd m4, [r2 + r3 + 2] ;[x x x C] + movd m0, [r2 + r3 + 4] ;[x x x D] + punpcklbw m3, m2 ;[x x A B] + punpcklbw m0, m4 ;[x x C D] + punpcklwd m0, m3 ;[A B C D] + movd m1, [r2 + r4 + 1] ;[4 3 2 1] + punpckldq m0, m1 ;[4 3 2 1 A B C D] + punpcklbw m0, m0 ;[4 4 3 3 2 2 1 1 A A B B C C D D] + psrldq m0, 1 + movh m1, m0 + psrldq m0, 2 + movh m2, m0 + punpcklqdq m2, m1 + psrldq m0, 2 + movh m1, m0 + psrldq m0, 2 + punpcklqdq m0, m1 + + lea r3, [pw_ang_table + 14 * 16] + mova m4, [r3 - 8 * 16] ; [ 6] + mova m5, [r3 - 2 * 16] ; [12] + mova m6, [r3 + 4 * 16] ; [18] + mova m7, [r3 + 10 * 16] ; [24] + jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + +cglobal intra_pred_ang4_18, 3,4,2 + mov r3d, [r2 + 8] + mov r3b, byte [r2] + bswap r3d + movd m0, r3d + + movd m1, [r2 + 1] + punpckldq m0, m1 + lea r3, [r1 * 3] + movd [r0 + r3], m0 + psrldq m0, 1 + movd [r0 + r1 * 2], m0 + psrldq m0, 1 + movd [r0 + r1], m0 + psrldq m0, 1 + movd [r0], m0 + RET + ;--------------------------------------------------------------------------------------------- ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter) ;--------------------------------------------------------------------------------------------- @@ -1809,6 +2400,69 @@ RET +;--------------------------------------------------------------------------------------------- +; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter) +;--------------------------------------------------------------------------------------------- +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal intra_pred_dc32, 3, 4, 3 + lea r3, [r1 * 3] + pxor m0, m0 + movu m1, [r2 + 1] + movu m2, [r2 + 65] + psadbw m1, m0 + psadbw m2, m0 + paddw m1, m2 + vextracti128 xm2, m1, 1 + paddw m1, m2 + pshufd m2, m1, 2 + paddw m1, m2 + + pmulhrsw m1, [pw_512] ; sum = (sum + 32) / 64 + vpbroadcastb m1, xm1 ; m1 = byte [dc_val ...] + + movu [r0 + r1 * 0], m1 + movu [r0 + r1 * 1], m1 + movu [r0 + r1 * 2], m1 + movu [r0 + r3 * 1], m1 + lea r0, [r0 + 4 * r1] + movu [r0 + r1 * 0], m1 + movu [r0 + r1 * 1], m1 + movu [r0 + r1 * 2], m1 + movu [r0 + r3 * 1], m1 + lea r0, [r0 + 4 * r1] + movu [r0 + r1 * 0], m1 + movu [r0 + r1 * 1], m1 + movu [r0 + r1 * 2], m1 + movu [r0 + r3 * 1], m1 + lea r0, [r0 + 4 * r1] + movu [r0 + r1 * 0], m1 + movu [r0 + r1 * 1], m1 + movu [r0 + r1 * 2], m1 + movu [r0 + r3 * 1], m1 + lea r0, [r0 + 4 * r1] + movu [r0 + r1 * 0], m1 + movu [r0 + r1 * 1], m1 + movu [r0 + r1 * 2], m1 + movu [r0 + r3 * 1], m1 + lea r0, [r0 + 4 * r1] + movu [r0 + r1 * 0], m1 + movu [r0 + r1 * 1], m1 + movu [r0 + r1 * 2], m1 + movu [r0 + r3 * 1], m1 + lea r0, [r0 + 4 * r1] + movu [r0 + r1 * 0], m1 + movu [r0 + r1 * 1], m1 + movu [r0 + r1 * 2], m1 + movu [r0 + r3 * 1], m1 + lea r0, [r0 + 4 * r1] + movu [r0 + r1 * 0], m1 + movu [r0 + r1 * 1], m1 + movu [r0 + r1 * 2], m1 + movu [r0 + r3 * 1], m1 + RET +%endif ;; ARCH_X86_64 == 1 + ;--------------------------------------------------------------------------------------- ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) ;--------------------------------------------------------------------------------------- @@ -2000,6 +2654,57 @@ ;--------------------------------------------------------------------------------------- ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) ;--------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal intra_pred_planar16, 3,3,6 + vpbroadcastw m3, [r2 + 17] + mova m5, [pw_00ff] + vpbroadcastw m4, [r2 + 49] + mova m0, [pw_planar16_mul] + pmovzxbw m2, [r2 + 1] + pand m3, m5 ; v_topRight + pand m4, m5 ; v_bottomLeft + + pmullw m3, [multiL] ; (x + 1) * topRight + pmullw m1, m2, [pw_15] ; (blkSize - 1 - y) * above[x] + paddw m3, [pw_16] + paddw m3, m4 + paddw m3, m1 + psubw m4, m2 + add r2, 33 + +%macro INTRA_PRED_PLANAR16_AVX2 1 + vpbroadcastw m1, [r2 + %1] + vpsrlw m2, m1, 8 + pand m1, m5 + + pmullw m1, m0 + pmullw m2, m0 + paddw m1, m3 + paddw m3, m4 + psraw m1, 5 + paddw m2, m3 + psraw m2, 5 + paddw m3, m4 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r0], xm1 + vextracti128 [r0 + r1], m1, 1 + lea r0, [r0 + r1 * 2] +%endmacro + INTRA_PRED_PLANAR16_AVX2 0 + INTRA_PRED_PLANAR16_AVX2 2 + INTRA_PRED_PLANAR16_AVX2 4 + INTRA_PRED_PLANAR16_AVX2 6 + INTRA_PRED_PLANAR16_AVX2 8 + INTRA_PRED_PLANAR16_AVX2 10 + INTRA_PRED_PLANAR16_AVX2 12 + INTRA_PRED_PLANAR16_AVX2 14 +%undef INTRA_PRED_PLANAR16_AVX2 + RET + +;--------------------------------------------------------------------------------------- +; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) +;--------------------------------------------------------------------------------------- INIT_XMM sse4 %if ARCH_X86_64 == 1 cglobal intra_pred_planar32, 3,4,12 @@ -2104,6 +2809,91 @@ jnz .loop RET +;--------------------------------------------------------------------------------------- +; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) +;--------------------------------------------------------------------------------------- +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal intra_pred_planar32, 3,4,11 + mova m6, [pw_00ff] + vpbroadcastw m3, [r2 + 33] ; topRight = above[32] + vpbroadcastw m2, [r2 + 97] ; bottomLeft = left[32] + pand m3, m6 + pand m2, m6 + + pmullw m0, m3, [multiL] ; (x + 1) * topRight + pmullw m3, [multiH2] ; (x + 1) * topRight + + paddw m0, m2 + paddw m3, m2 + paddw m0, [pw_32] + paddw m3, [pw_32] + + pmovzxbw m4, [r2 + 1] + pmovzxbw m1, [r2 + 17] + pmullw m5, m4, [pw_31] + paddw m0, m5 + psubw m5, m2, m4 + psubw m2, m1 + pmullw m1, [pw_31] + paddw m3, m1 + mova m1, m5 + + add r2, 65 ; (2 * blkSize + 1) + mova m9, [pw_planar32_mul] + mova m10, [pw_planar16_mul] + +%macro INTRA_PRED_PLANAR32_AVX2 0 + vpbroadcastw m4, [r2] + vpsrlw m7, m4, 8 + pand m4, m6 + + pmullw m5, m4, m9 + pmullw m4, m4, m10 + paddw m5, m0 + paddw m4, m3 + paddw m0, m1 + paddw m3, m2 + psraw m5, 6 + psraw m4, 6 + packuswb m5, m4 + pmullw m8, m7, m9 + pmullw m7, m7, m10 + vpermq m5, m5, 11011000b + paddw m8, m0 + paddw m7, m3 + paddw m0, m1 + paddw m3, m2 + psraw m8, 6 + psraw m7, 6 + packuswb m8, m7 + add r2, 2 + vpermq m8, m8, 11011000b + + movu [r0], m5 + movu [r0 + r1], m8 + lea r0, [r0 + r1 * 2] +%endmacro + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 + INTRA_PRED_PLANAR32_AVX2 +%undef INTRA_PRED_PLANAR32_AVX2 + RET +%endif ;; ARCH_X86_64 == 1 + ;----------------------------------------------------------------------------------------- ; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter) ;----------------------------------------------------------------------------------------- @@ -9577,6 +10367,99 @@ RET +INIT_YMM avx2 +cglobal intra_pred_ang32_18, 4, 4, 3 + movu m0, [r2] + movu xm1, [r2 + 1 + 64] + pshufb xm1, [intra_pred_shuff_15_0] + mova xm2, xm0 + vinserti128 m1, m1, xm2, 1 + + lea r3, [r1 * 3] + + movu [r0], m0 + palignr m2, m0, m1, 15 + movu [r0 + r1], m2 + palignr m2, m0, m1, 14 + movu [r0 + r1 * 2], m2 + palignr m2, m0, m1, 13 + movu [r0 + r3], m2 + + lea r0, [r0 + r1 * 4] + palignr m2, m0, m1, 12 + movu [r0], m2 + palignr m2, m0, m1, 11 + movu [r0 + r1], m2 + palignr m2, m0, m1, 10 + movu [r0 + r1 * 2], m2 + palignr m2, m0, m1, 9 + movu [r0 + r3], m2 + + lea r0, [r0 + r1 * 4] + palignr m2, m0, m1, 8 + movu [r0], m2 + palignr m2, m0, m1, 7 + movu [r0 + r1], m2 + palignr m2, m0, m1, 6 + movu [r0 + r1 * 2], m2 + palignr m2, m0, m1, 5 + movu [r0 + r3], m2 + + lea r0, [r0 + r1 * 4] + palignr m2, m0, m1, 4 + movu [r0], m2 + palignr m2, m0, m1, 3 + movu [r0 + r1], m2 + palignr m2, m0, m1, 2 + movu [r0 + r1 * 2], m2 + palignr m2, m0, m1, 1 + movu [r0 + r3], m2 + + lea r0, [r0 + r1 * 4] + movu [r0], m1 + + movu xm0, [r2 + 64 + 17] + pshufb xm0, [intra_pred_shuff_15_0] + vinserti128 m0, m0, xm1, 1 + + palignr m2, m1, m0, 15 + movu [r0 + r1], m2 + palignr m2, m1, m0, 14 + movu [r0 + r1 * 2], m2 + palignr m2, m1, m0, 13 + movu [r0 + r3], m2 + + lea r0, [r0 + r1 * 4] + palignr m2, m1, m0, 12 + movu [r0], m2 + palignr m2, m1, m0, 11 + movu [r0 + r1], m2 + palignr m2, m1, m0, 10 + movu [r0 + r1 * 2], m2 + palignr m2, m1, m0, 9 + movu [r0 + r3], m2 + + lea r0, [r0 + r1 * 4] + palignr m2, m1, m0, 8 + movu [r0], m2 + palignr m2, m1, m0, 7 + movu [r0 + r1], m2 + palignr m2, m1, m0,6 + movu [r0 + r1 * 2], m2 + palignr m2, m1, m0, 5 + movu [r0 + r3], m2 + + lea r0, [r0 + r1 * 4] + palignr m2, m1, m0, 4 + movu [r0], m2 + palignr m2, m1, m0, 3 + movu [r0 + r1], m2 + palignr m2, m1, m0,2 + movu [r0 + r1 * 2], m2 + palignr m2, m1, m0, 1 + movu [r0 + r3], m2 + RET + INIT_XMM sse4 cglobal intra_pred_ang32_18, 4,5,5 movu m0, [r2] ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0] @@ -11099,6 +11982,441 @@ movhps [r0 + r3], xm2 RET +INIT_YMM avx2 +cglobal intra_pred_ang8_15, 3, 6, 6 + mova m3, [pw_1024] + movu xm5, [r2 + 16] + pinsrb xm5, [r2], 0 + lea r5, [intra_pred_shuff_0_8] + mova xm0, xm5 + pslldq xm5, 1 + pinsrb xm5, [r2 + 2], 0 + vinserti128 m0, m0, xm5, 1 + pshufb m0, [r5] + + lea r4, [c_ang8_mode_15] + pmaddubsw m1, m0, [r4] + pmulhrsw m1, m3 + mova xm0, xm5 + pslldq xm5, 1 + pinsrb xm5, [r2 + 4], 0 + vinserti128 m0, m0, xm5, 1 + pshufb m0, [r5] + pmaddubsw m2, m0, [r4 + mmsize] + pmulhrsw m2, m3 + mova xm0, xm5 + pslldq xm5, 1 + pinsrb xm5, [r2 + 6], 0 + vinserti128 m0, m0, xm5, 1 + pshufb m0, [r5] + pmaddubsw m4, m0, [r4 + 2 * mmsize] + pmulhrsw m4, m3 + mova xm0, xm5 + pslldq xm5, 1 + pinsrb xm5, [r2 + 8], 0 + vinserti128 m0, m0, xm5, 1 + pshufb m0, [r5] + pmaddubsw m0, [r4 + 3 * mmsize] + pmulhrsw m0, m3 + packuswb m1, m2 + packuswb m4, m0 + + vperm2i128 m2, m1, m4, 00100000b + vperm2i128 m1, m1, m4, 00110001b + punpcklbw m4, m2, m1 + punpckhbw m2, m1 + punpcklwd m1, m4, m2 + punpckhwd m4, m2 + mova m0, [trans8_shuf] + vpermd m1, m0, m1 + vpermd m4, m0, m4 + + lea r3, [3 * r1] + movq [r0], xm1 + movhps [r0 + r1], xm1 + vextracti128 xm2, m1, 1 + movq [r0 + 2 * r1], xm2 + movhps [r0 + r3], xm2 + lea r0, [r0 + 4 * r1] + movq [r0], xm4 + movhps [r0 + r1], xm4 + vextracti128 xm2, m4, 1 + movq [r0 + 2 * r1], xm2 + movhps [r0 + r3], xm2 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang8_16, 3, 6, 6 + mova m3, [pw_1024] + movu xm5, [r2 + 16] + pinsrb xm5, [r2], 0 + lea r5, [intra_pred_shuff_0_8] + mova xm0, xm5 + pslldq xm5, 1 + pinsrb xm5, [r2 + 2], 0 + vinserti128 m0, m0, xm5, 1 + pshufb m0, [r5] + + lea r4, [c_ang8_mode_20] + pmaddubsw m1, m0, [r4] + pmulhrsw m1, m3 + mova xm0, xm5 + pslldq xm5, 1 + pinsrb xm5, [r2 + 3], 0 + vinserti128 m0, m0, xm5, 1 + pshufb m0, [r5] + pmaddubsw m2, m0, [r4 + mmsize] + pmulhrsw m2, m3 + pslldq xm5, 1 + pinsrb xm5, [r2 + 5], 0 + vinserti128 m0, m5, xm5, 1 + pshufb m0, [r5] + pmaddubsw m4, m0, [r4 + 2 * mmsize] + pmulhrsw m4, m3 + pslldq xm5, 1 + pinsrb xm5, [r2 + 6], 0 + mova xm0, xm5 + pslldq xm5, 1 + pinsrb xm5, [r2 + 8], 0 + vinserti128 m0, m0, xm5, 1 + pshufb m0, [r5] + pmaddubsw m0, [r4 + 3 * mmsize] + pmulhrsw m0, m3 + + packuswb m1, m2 + packuswb m4, m0 + + vperm2i128 m2, m1, m4, 00100000b + vperm2i128 m1, m1, m4, 00110001b + punpcklbw m4, m2, m1 + punpckhbw m2, m1 + punpcklwd m1, m4, m2 + punpckhwd m4, m2 + mova m0, [trans8_shuf] + vpermd m1, m0, m1 + vpermd m4, m0, m4 + + lea r3, [3 * r1] + movq [r0], xm1 + movhps [r0 + r1], xm1 + vextracti128 xm2, m1, 1 + movq [r0 + 2 * r1], xm2 + movhps [r0 + r3], xm2 + lea r0, [r0 + 4 * r1] + movq [r0], xm4 + movhps [r0 + r1], xm4 + vextracti128 xm2, m4, 1 + movq [r0 + 2 * r1], xm2 + movhps [r0 + r3], xm2 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang8_20, 3, 6, 6 + mova m3, [pw_1024] + movu xm5, [r2] + lea r5, [intra_pred_shuff_0_8] + mova xm0, xm5 + pslldq xm5, 1 + pinsrb xm5, [r2 + 2 + 16], 0 + vinserti128 m0, m0, xm5, 1 + pshufb m0, [r5] + + lea r4, [c_ang8_mode_20] + pmaddubsw m1, m0, [r4] + pmulhrsw m1, m3 + mova xm0, xm5 + pslldq xm5, 1 + pinsrb xm5, [r2 + 3 + 16], 0 + vinserti128 m0, m0, xm5, 1 + pshufb m0, [r5] + pmaddubsw m2, m0, [r4 + mmsize] + pmulhrsw m2, m3 + pslldq xm5, 1 + pinsrb xm5, [r2 + 5 + 16], 0 + vinserti128 m0, m5, xm5, 1 + pshufb m0, [r5] + pmaddubsw m4, m0, [r4 + 2 * mmsize] + pmulhrsw m4, m3 + pslldq xm5, 1 + pinsrb xm5, [r2 + 6 + 16], 0 + mova xm0, xm5 + pslldq xm5, 1 + pinsrb xm5, [r2 + 8 + 16], 0 + vinserti128 m0, m0, xm5, 1 + pshufb m0, [r5] + pmaddubsw m0, [r4 + 3 * mmsize] + pmulhrsw m0, m3 + + packuswb m1, m2 + packuswb m4, m0 + + lea r3, [3 * r1] + movq [r0], xm1 + vextracti128 xm2, m1, 1 + movq [r0 + r1], xm2 + movhps [r0 + 2 * r1], xm1 + movhps [r0 + r3], xm2 + lea r0, [r0 + 4 * r1] + movq [r0], xm4 + vextracti128 xm2, m4, 1 + movq [r0 + r1], xm2 + movhps [r0 + 2 * r1], xm4 + movhps [r0 + r3], xm2 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang8_21, 3, 6, 6 + mova m3, [pw_1024] + movu xm5, [r2] + lea r5, [intra_pred_shuff_0_8] + mova xm0, xm5 + pslldq xm5, 1 + pinsrb xm5, [r2 + 2 + 16], 0 + vinserti128 m0, m0, xm5, 1 + pshufb m0, [r5] + + lea r4, [c_ang8_mode_15] + pmaddubsw m1, m0, [r4] + pmulhrsw m1, m3 + mova xm0, xm5 + pslldq xm5, 1 + pinsrb xm5, [r2 + 4 + 16], 0 + vinserti128 m0, m0, xm5, 1 + pshufb m0, [r5] + pmaddubsw m2, m0, [r4 + mmsize] + pmulhrsw m2, m3 + mova xm0, xm5 + pslldq xm5, 1 + pinsrb xm5, [r2 + 6 + 16], 0 + vinserti128 m0, m0, xm5, 1 + pshufb m0, [r5] + pmaddubsw m4, m0, [r4 + 2 * mmsize] + pmulhrsw m4, m3 + mova xm0, xm5 + pslldq xm5, 1 + pinsrb xm5, [r2 + 8 + 16], 0 + vinserti128 m0, m0, xm5, 1 + pshufb m0, [r5] + pmaddubsw m0, [r4 + 3 * mmsize] + pmulhrsw m0, m3 + packuswb m1, m2 + packuswb m4, m0 + + lea r3, [3 * r1] + movq [r0], xm1 + vextracti128 xm2, m1, 1 + movq [r0 + r1], xm2 + movhps [r0 + 2 * r1], xm1 + movhps [r0 + r3], xm2 + lea r0, [r0 + 4 * r1] + movq [r0], xm4 + vextracti128 xm2, m4, 1 + movq [r0 + r1], xm2 + movhps [r0 + 2 * r1], xm4 + movhps [r0 + r3], xm2 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang8_22, 3, 6, 6 + mova m3, [pw_1024] + movu xm5, [r2] + lea r5, [intra_pred_shuff_0_8] + vinserti128 m0, m5, xm5, 1 + pshufb m0, [r5] + + lea r4, [c_ang8_mode_14] + pmaddubsw m1, m0, [r4] + pmulhrsw m1, m3 + pslldq xm5, 1 + pinsrb xm5, [r2 + 2 + 16], 0 + vinserti128 m0, m5, xm5, 1 + pshufb m0, [r5] + pmaddubsw m2, m0, [r4 + mmsize] + pmulhrsw m2, m3 + pslldq xm5, 1 + pinsrb xm5, [r2 + 5 + 16], 0 + vinserti128 m0, m5, xm5, 1 + pshufb m0, [r5] + pmaddubsw m4, m0, [r4 + 2 * mmsize] + pmulhrsw m4, m3 + pslldq xm5, 1 + pinsrb xm5, [r2 + 7 + 16], 0 + pshufb xm5, [r5] + vinserti128 m0, m0, xm5, 1 + pmaddubsw m0, [r4 + 3 * mmsize] + pmulhrsw m0, m3 + packuswb m1, m2 + packuswb m4, m0 + + lea r3, [3 * r1] + movq [r0], xm1 + vextracti128 xm2, m1, 1 + movq [r0 + r1], xm2 + movhps [r0 + 2 * r1], xm1 + movhps [r0 + r3], xm2 + lea r0, [r0 + 4 * r1] + movq [r0], xm4 + vextracti128 xm2, m4, 1 + movq [r0 + r1], xm2 + movhps [r0 + 2 * r1], xm4 + movhps [r0 + r3], xm2 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang8_14, 3, 6, 6 + mova m3, [pw_1024] + movu xm5, [r2 + 16] + pinsrb xm5, [r2], 0 + lea r5, [intra_pred_shuff_0_8] + vinserti128 m0, m5, xm5, 1 + pshufb m0, [r5] + + lea r4, [c_ang8_mode_14] + pmaddubsw m1, m0, [r4] + pmulhrsw m1, m3 + pslldq xm5, 1 + pinsrb xm5, [r2 + 2], 0 + vinserti128 m0, m5, xm5, 1 + pshufb m0, [r5] + pmaddubsw m2, m0, [r4 + mmsize] + pmulhrsw m2, m3 + pslldq xm5, 1 + pinsrb xm5, [r2 + 5], 0 + vinserti128 m0, m5, xm5, 1 + pshufb m0, [r5] + pmaddubsw m4, m0, [r4 + 2 * mmsize] + pmulhrsw m4, m3 + pslldq xm5, 1 + pinsrb xm5, [r2 + 7], 0 + pshufb xm5, [r5] + vinserti128 m0, m0, xm5, 1 + pmaddubsw m0, [r4 + 3 * mmsize] + pmulhrsw m0, m3 + packuswb m1, m2 + packuswb m4, m0 + + vperm2i128 m2, m1, m4, 00100000b + vperm2i128 m1, m1, m4, 00110001b + punpcklbw m4, m2, m1 + punpckhbw m2, m1 + punpcklwd m1, m4, m2 + punpckhwd m4, m2 + mova m0, [trans8_shuf] + vpermd m1, m0, m1 + vpermd m4, m0, m4 + + lea r3, [3 * r1] + movq [r0], xm1 + movhps [r0 + r1], xm1 + vextracti128 xm2, m1, 1 + movq [r0 + 2 * r1], xm2 + movhps [r0 + r3], xm2 + lea r0, [r0 + 4 * r1] + movq [r0], xm4 + movhps [r0 + r1], xm4 + vextracti128 xm2, m4, 1 + movq [r0 + 2 * r1], xm2 + movhps [r0 + r3], xm2 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang8_13, 3, 6, 6 + mova m3, [pw_1024] + movu xm5, [r2 + 16] + pinsrb xm5, [r2], 0 + lea r5, [intra_pred_shuff_0_8] + vinserti128 m0, m5, xm5, 1 + pshufb m0, [r5] + + lea r4, [c_ang8_mode_13] + pmaddubsw m1, m0, [r4] + pmulhrsw m1, m3 + pslldq xm5, 1 + pinsrb xm5, [r2 + 4], 0 + pshufb xm4, xm5, [r5] + vinserti128 m0, m0, xm4, 1 + pmaddubsw m2, m0, [r4 + mmsize] + pmulhrsw m2, m3 + vinserti128 m0, m0, xm4, 0 + pmaddubsw m4, m0, [r4 + 2 * mmsize] + pmulhrsw m4, m3 + pslldq xm5, 1 + pinsrb xm5, [r2 + 7], 0 + pshufb xm5, [r5] + vinserti128 m0, m0, xm5, 1 + pmaddubsw m0, [r4 + 3 * mmsize] + pmulhrsw m0, m3 + packuswb m1, m2 + packuswb m4, m0 + + vperm2i128 m2, m1, m4, 00100000b + vperm2i128 m1, m1, m4, 00110001b + punpcklbw m4, m2, m1 + punpckhbw m2, m1 + punpcklwd m1, m4, m2 + punpckhwd m4, m2 + mova m0, [trans8_shuf] + vpermd m1, m0, m1 + vpermd m4, m0, m4 + + lea r3, [3 * r1] + movq [r0], xm1 + movhps [r0 + r1], xm1 + vextracti128 xm2, m1, 1 + movq [r0 + 2 * r1], xm2 + movhps [r0 + r3], xm2 + lea r0, [r0 + 4 * r1] + movq [r0], xm4 + movhps [r0 + r1], xm4 + vextracti128 xm2, m4, 1 + movq [r0 + 2 * r1], xm2 + movhps [r0 + r3], xm2 + RET + + +INIT_YMM avx2 +cglobal intra_pred_ang8_23, 3, 6, 6 + mova m3, [pw_1024] + movu xm5, [r2] + lea r5, [intra_pred_shuff_0_8] + vinserti128 m0, m5, xm5, 1 + pshufb m0, [r5] + + lea r4, [c_ang8_mode_13] + pmaddubsw m1, m0, [r4] + pmulhrsw m1, m3 + pslldq xm5, 1 + pinsrb xm5, [r2 + 4 + 16], 0 + pshufb xm4, xm5, [r5] + vinserti128 m0, m0, xm4, 1 + pmaddubsw m2, m0, [r4 + mmsize] + pmulhrsw m2, m3 + vinserti128 m0, m0, xm4, 0 + pmaddubsw m4, m0, [r4 + 2 * mmsize] + pmulhrsw m4, m3 + pslldq xm5, 1 + pinsrb xm5, [r2 + 7 + 16], 0 + pshufb xm5, [r5] + vinserti128 m0, m0, xm5, 1 + pmaddubsw m0, [r4 + 3 * mmsize] + pmulhrsw m0, m3 + + packuswb m1, m2 + packuswb m4, m0 + + lea r3, [3 * r1] + movq [r0], xm1 + vextracti128 xm2, m1, 1 + movq [r0 + r1], xm2 + movhps [r0 + 2 * r1], xm1 + movhps [r0 + r3], xm2 + lea r0, [r0 + 4 * r1] + movq [r0], xm4 + vextracti128 xm2, m4, 1 + movq [r0 + r1], xm2 + movhps [r0 + 2 * r1], xm4 + movhps [r0 + r3], xm2 + RET INIT_YMM avx2 cglobal intra_pred_ang8_12, 3, 5, 5 @@ -11228,6 +12546,849 @@ movu [%2], xm3 %endmacro +%if ARCH_X86_64 == 1 +%macro INTRA_PRED_TRANS_STORE_16x16 0 + punpcklbw m8, m0, m1 + punpckhbw m0, m1 + + punpcklbw m1, m2, m3 + punpckhbw m2, m3 + + punpcklbw m3, m4, m5 + punpckhbw m4, m5 + + punpcklbw m5, m6, m7 + punpckhbw m6, m7 + + punpcklwd m7, m8, m1 + punpckhwd m8, m1 + + punpcklwd m1, m3, m5 + punpckhwd m3, m5 + + punpcklwd m5, m0, m2 + punpckhwd m0, m2 + + punpcklwd m2, m4, m6 + punpckhwd m4, m6 + + punpckldq m6, m7, m1 + punpckhdq m7, m1 + + punpckldq m1, m8, m3 + punpckhdq m8, m3 + + punpckldq m3, m5, m2 + punpckhdq m5, m2 + + punpckldq m2, m0, m4 + punpckhdq m0, m4 + + vpermq m6, m6, 0xD8 + vpermq m7, m7, 0xD8 + vpermq m1, m1, 0xD8 + vpermq m8, m8, 0xD8 + vpermq m3, m3, 0xD8 + vpermq m5, m5, 0xD8 + vpermq m2, m2, 0xD8 + vpermq m0, m0, 0xD8 + + movu [r0], xm6 + vextracti128 xm4, m6, 1 + movu [r0 + r1], xm4 + + movu [r0 + 2 * r1], xm7 + vextracti128 xm4, m7, 1 + movu [r0 + r3], xm4 + + lea r0, [r0 + 4 * r1] + + movu [r0], xm1 + vextracti128 xm4, m1, 1 + movu [r0 + r1], xm4 + + movu [r0 + 2 * r1], xm8 + vextracti128 xm4, m8, 1 + movu [r0 + r3], xm4 + + lea r0, [r0 + 4 * r1] + + movu [r0], xm3 + vextracti128 xm4, m3, 1 + movu [r0 + r1], xm4 + + movu [r0 + 2 * r1], xm5 + vextracti128 xm4, m5, 1 + movu [r0 + r3], xm4 + + lea r0, [r0 + 4 * r1] + + movu [r0], xm2 + vextracti128 xm4, m2, 1 + movu [r0 + r1], xm4 + + movu [r0 + 2 * r1], xm0 + vextracti128 xm4, m0, 1 + movu [r0 + r3], xm4 +%endmacro + +%macro INTRA_PRED_ANG16_CAL_ROW 3 + pmaddubsw %1, m9, [r4 + (%3 * mmsize)] + pmulhrsw %1, m11 + pmaddubsw %2, m10, [r4 + (%3 * mmsize)] + pmulhrsw %2, m11 + packuswb %1, %2 +%endmacro + + +INIT_YMM avx2 +cglobal intra_pred_ang16_12, 3, 6, 13 + mova m11, [pw_1024] + lea r5, [intra_pred_shuff_0_8] + + movu xm9, [r2 + 32] + pinsrb xm9, [r2], 0 + pslldq xm7, xm9, 1 + pinsrb xm7, [r2 + 6], 0 + vinserti128 m9, m9, xm7, 1 + pshufb m9, [r5] + + movu xm12, [r2 + 6 + 32] + + psrldq xm10, xm12, 2 + psrldq xm8, xm12, 1 + vinserti128 m10, m10, xm8, 1 + pshufb m10, [r5] + + lea r3, [3 * r1] + lea r4, [c_ang16_mode_12] + + INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 + INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 + INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 + INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 + + add r4, 4 * mmsize + + pslldq xm7, 1 + pinsrb xm7, [r2 + 13], 0 + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + mova xm8, xm12 + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 + INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 + + movu xm9, [r2 + 31] + pinsrb xm9, [r2 + 6], 0 + pinsrb xm9, [r2 + 0], 1 + pshufb xm9, [r5] + vinserti128 m9, m9, xm7, 1 + + psrldq xm10, xm12, 1 + vinserti128 m10, m10, xm12, 1 + pshufb m10, [r5] + + INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 + INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 + + ; transpose and store + INTRA_PRED_TRANS_STORE_16x16 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang16_13, 3, 6, 14 + mova m11, [pw_1024] + lea r5, [intra_pred_shuff_0_8] + + movu xm13, [r2 + 32] + pinsrb xm13, [r2], 0 + pslldq xm7, xm13, 2 + pinsrb xm7, [r2 + 7], 0 + pinsrb xm7, [r2 + 4], 1 + vinserti128 m9, m13, xm7, 1 + pshufb m9, [r5] + + movu xm12, [r2 + 4 + 32] + + psrldq xm10, xm12, 4 + psrldq xm8, xm12, 2 + vinserti128 m10, m10, xm8, 1 + pshufb m10, [r5] + + lea r3, [3 * r1] + lea r4, [c_ang16_mode_13] + + INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 + INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 + + pslldq xm7, 1 + pinsrb xm7, [r2 + 11], 0 + pshufb xm2, xm7, [r5] + vinserti128 m9, m9, xm2, 1 + + psrldq xm8, xm12, 1 + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 + + pslldq xm13, 1 + pinsrb xm13, [r2 + 4], 0 + pshufb xm3, xm13, [r5] + vinserti128 m9, m9, xm3, 0 + + psrldq xm8, xm12, 3 + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 0 + + INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 + + add r4, 4 * mmsize + + INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 + INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 + + pslldq xm7, 1 + pinsrb xm7, [r2 + 14], 0 + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + mova xm8, xm12 + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 + + pslldq xm13, 1 + pinsrb xm13, [r2 + 7], 0 + pshufb xm13, [r5] + vinserti128 m9, m9, xm13, 0 + + psrldq xm12, 2 + pshufb xm12, [r5] + vinserti128 m10, m10, xm12, 0 + + INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 + + ; transpose and store + INTRA_PRED_TRANS_STORE_16x16 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang16_11, 3, 5, 12 + mova m11, [pw_1024] + + movu xm9, [r2 + 32] + pinsrb xm9, [r2], 0 + pshufb xm9, [intra_pred_shuff_0_8] + vinserti128 m9, m9, xm9, 1 + + vbroadcasti128 m10, [r2 + 8 + 32] + pshufb m10, [intra_pred_shuff_0_8] + + lea r3, [3 * r1] + lea r4, [c_ang16_mode_11] + + INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 + INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 + INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 + INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 + + add r4, 4 * mmsize + + INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 + INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 + INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 + INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 + + ; transpose and store + INTRA_PRED_TRANS_STORE_16x16 + RET + + +INIT_YMM avx2 +cglobal intra_pred_ang16_3, 3, 6, 12 + mova m11, [pw_1024] + lea r5, [intra_pred_shuff_0_8] + + movu xm9, [r2 + 1 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 9 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 8 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 16 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + lea r3, [3 * r1] + lea r4, [c_ang16_mode_3] + + INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 + + movu xm9, [r2 + 2 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 10 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 9 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 17 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 + + movu xm7, [r2 + 3 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 0 + + movu xm8, [r2 + 11 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 0 + + INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 + + movu xm9, [r2 + 4 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 12 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 10 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 18 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 + + movu xm9, [r2 + 5 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 13 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 11 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 19 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + add r4, 4 * mmsize + + INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 + + movu xm7, [r2 + 12 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 20 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 + + movu xm9, [r2 + 6 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 14 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 13 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 21 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 + + movu xm9, [r2 + 7 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 15 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 14 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 22 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 + + ; transpose and store + INTRA_PRED_TRANS_STORE_16x16 + RET + + +INIT_YMM avx2 +cglobal intra_pred_ang16_4, 3, 6, 12 + mova m11, [pw_1024] + lea r5, [intra_pred_shuff_0_8] + + movu xm9, [r2 + 1 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 9 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 6 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 14 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + lea r3, [3 * r1] + lea r4, [c_ang16_mode_4] + + INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 + + movu xm9, [r2 + 2 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 10 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 7 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 15 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 + + movu xm7, [r2 + 8 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 16 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 + + movu xm7, [r2 + 3 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 0 + + movu xm8, [r2 + 11 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 0 + + INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 + + add r4, 4 * mmsize + + movu xm9, [r2 + 4 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 12 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 9 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 17 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 + + movu xm7, [r2 + 10 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 18 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 + + movu xm7, [r2 + 5 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 0 + + movu xm8, [r2 + 13 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 0 + + INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 + + movu xm9, [r2 + 6 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 14 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 11 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 19 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 + + ; transpose and store + INTRA_PRED_TRANS_STORE_16x16 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang16_5, 3, 6, 12 + mova m11, [pw_1024] + lea r5, [intra_pred_shuff_0_8] + + movu xm9, [r2 + 1 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 9 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 5 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 13 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + lea r3, [3 * r1] + lea r4, [c_ang16_mode_5] + + INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 + + movu xm9, [r2 + 2 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 10 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 6 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 14 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 + INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 + + movu xm9, [r2 + 3 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 11 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 7 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 15 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 + + add r4, 4 * mmsize + + INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 + + movu xm9, [r2 + 4 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 12 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 8 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 16 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 + INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 + + movu xm9, [r2 + 5 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 13 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 9 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 17 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 + + ; transpose and store + INTRA_PRED_TRANS_STORE_16x16 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang16_6, 3, 6, 12 + mova m11, [pw_1024] + lea r5, [intra_pred_shuff_0_8] + + movu xm9, [r2 + 1 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 9 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 4 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 12 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + lea r3, [3 * r1] + lea r4, [c_ang16_mode_6] + + INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 + + movu xm7, [r2 + 5 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 13 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 + + movu xm7, [r2 + 2 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 0 + + movu xm8, [r2 + 10 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 0 + + INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 + INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 + + add r4, 4 * mmsize + + movu xm9, [r2 + 3 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 11 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 6 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 14 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 + INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 + + movu xm7, [r2 + 7 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 15 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 + + movu xm7, [r2 + 4 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 0 + + movu xm8, [r2 + 12 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 0 + + INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 + + ; transpose and store + INTRA_PRED_TRANS_STORE_16x16 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang16_7, 3, 6, 12 + mova m11, [pw_1024] + lea r5, [intra_pred_shuff_0_8] + + movu xm9, [r2 + 1 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 9 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 3 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 11 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + lea r3, [3 * r1] + lea r4, [c_ang16_mode_7] + + INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 + INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 + + movu xm7, [r2 + 4 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 12 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 + + movu xm7, [r2 + 2 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 0 + + movu xm8, [r2 + 10 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 0 + + INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 + + add r4, 4 * mmsize + + INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 + INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 + + movu xm7, [r2 + 5 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 13 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 + + movu xm7, [r2 + 3 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 0 + + movu xm8, [r2 + 11 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 0 + + INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 + + ; transpose and store + INTRA_PRED_TRANS_STORE_16x16 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang16_8, 3, 6, 12 + mova m11, [pw_1024] + lea r5, [intra_pred_shuff_0_8] + + movu xm9, [r2 + 1 + 32] + pshufb xm9, [r5] + movu xm10, [r2 + 9 + 32] + pshufb xm10, [r5] + + movu xm7, [r2 + 2 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm8, [r2 + 10 + 32] + pshufb xm8, [r5] + vinserti128 m10, m10, xm8, 1 + + lea r3, [3 * r1] + lea r4, [c_ang16_mode_8] + + INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 + INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 + INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 + INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 + + add r4, 4 * mmsize + + movu xm4, [r2 + 3 + 32] + pshufb xm4, [r5] + vinserti128 m9, m9, xm4, 1 + + movu xm5, [r2 + 11 + 32] + pshufb xm5, [r5] + vinserti128 m10, m10, xm5, 1 + + INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 + INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 + + vinserti128 m9, m9, xm7, 0 + vinserti128 m10, m10, xm8, 0 + + INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 + INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 + + ; transpose and store + INTRA_PRED_TRANS_STORE_16x16 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang16_9, 3, 6, 12 + mova m11, [pw_1024] + lea r5, [intra_pred_shuff_0_8] + + vbroadcasti128 m9, [r2 + 1 + 32] + pshufb m9, [r5] + vbroadcasti128 m10, [r2 + 9 + 32] + pshufb m10, [r5] + + lea r3, [3 * r1] + lea r4, [c_ang16_mode_9] + + INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 + INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 + INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 + INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 + + add r4, 4 * mmsize + + INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 + INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 + INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 + + movu xm7, [r2 + 2 + 32] + pshufb xm7, [r5] + vinserti128 m9, m9, xm7, 1 + + movu xm7, [r2 + 10 + 32] + pshufb xm7, [r5] + vinserti128 m10, m10, xm7, 1 + + INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 + + ; transpose and store + INTRA_PRED_TRANS_STORE_16x16 + RET +%endif + INIT_YMM avx2 cglobal intra_pred_ang16_25, 3, 5, 5 mova m0, [pw_1024] @@ -13514,5 +15675,2154 @@ vpermq m6, m6, 11011000b movu [r0 + r3], m6 RET + +INIT_YMM avx2 +cglobal intra_pred_ang32_33, 3, 5, 11 + mova m0, [pw_1024] + mova m1, [intra_pred_shuff_0_8] + lea r3, [3 * r1] + lea r4, [c_ang32_mode_33] + + ;row [0] + vbroadcasti128 m2, [r2 + 1] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 9] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 17] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 25] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 0 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 0 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0], m6 + + ;row [1] + vbroadcasti128 m2, [r2 + 2] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 10] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 18] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 26] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 1 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 1 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r1], m6 + + ;row [2] + vbroadcasti128 m2, [r2 + 3] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 11] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 19] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 27] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 2 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 2 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + 2 * r1], m6 + + ;row [3] + vbroadcasti128 m2, [r2 + 4] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 12] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 20] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 28] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 3 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 3 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r3], m6 + + ;row [4, 5] + vbroadcasti128 m2, [r2 + 5] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 13] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 21] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 29] + pshufb m5, m1 + + add r4, 4 * mmsize + lea r0, [r0 + 4 * r1] + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row [6] + vbroadcasti128 m2, [r2 + 6] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 14] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 22] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 30] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 1 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 1 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + 2 * r1], m6 + + ;row [7] + vbroadcasti128 m2, [r2 + 7] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 15] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 23] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 31] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 2 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 2 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r3], m6 + + ;row [8] + vbroadcasti128 m2, [r2 + 8] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 16] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 24] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 32] + pshufb m5, m1 + + lea r0, [r0 + 4 * r1] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 3 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 3 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0], m6 + + ;row [9, 10] + vbroadcasti128 m2, [r2 + 9] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 17] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 25] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 33] + pshufb m5, m1 + + add r4, 4 * mmsize + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;row [11] + vbroadcasti128 m2, [r2 + 10] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 18] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 26] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 34] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 1 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 1 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r3], m6 + + ;row [12] + vbroadcasti128 m2, [r2 + 11] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 19] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 27] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 35] + pshufb m5, m1 + + lea r0, [r0 + 4 * r1] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 2 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 2 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0], m6 + + ;row [13] + vbroadcasti128 m2, [r2 + 12] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 20] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 28] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 36] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 3 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 3 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r1], m6 + + ;row [14] + vbroadcasti128 m2, [r2 + 13] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 21] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 29] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 37] + pshufb m5, m1 + + add r4, 4 * mmsize + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 0 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 0 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + 2 * r1], m6 + + ;row [15, 16] + vbroadcasti128 m2, [r2 + 14] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 22] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 30] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 38] + pshufb m5, m1 + + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r3], m7 + lea r0, [r0 + 4 * r1] + movu [r0], m6 + + ;row [17] + vbroadcasti128 m2, [r2 + 15] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 23] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 31] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 39] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 2 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 2 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r1], m6 + + ;row [18] + vbroadcasti128 m2, [r2 + 16] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 24] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 32] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 40] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 3 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 3 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + 2 * r1], m6 + + ;row [19] + vbroadcasti128 m2, [r2 + 17] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 25] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 33] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 41] + pshufb m5, m1 + + add r4, 4 * mmsize + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 0 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 0 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r3], m6 + + ;row [20, 21] + vbroadcasti128 m2, [r2 + 18] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 26] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 34] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 42] + pshufb m5, m1 + + lea r0, [r0 + 4 * r1] + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row [22] + vbroadcasti128 m2, [r2 + 19] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 27] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 35] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 43] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 2 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 2 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + 2 * r1], m6 + + ;row [23] + vbroadcasti128 m2, [r2 + 20] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 28] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 36] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 44] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 3 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 3 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r3], m6 + + ;row [24] + vbroadcasti128 m2, [r2 + 21] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 29] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 37] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 45] + pshufb m5, m1 + + add r4, 4 * mmsize + lea r0, [r0 + 4 * r1] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 0 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 0 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0], m6 + + ;row [25, 26] + vbroadcasti128 m2, [r2 + 22] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 30] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 38] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 46] + pshufb m5, m1 + + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;row [27] + vbroadcasti128 m2, [r2 + 23] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 31] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 39] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 47] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 2 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 2 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r3], m6 + + ;row [28] + vbroadcasti128 m2, [r2 + 24] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 32] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 40] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 48] + pshufb m5, m1 + + lea r0, [r0 + 4 * r1] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 3 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 3 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0], m6 + + ;row [29] + vbroadcasti128 m2, [r2 + 25] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 33] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 41] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 49] + pshufb m5, m1 + + add r4, 4 * mmsize + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 0 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 0 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r1], m6 + + ;row [30] + vbroadcasti128 m2, [r2 + 26] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 34] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 42] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 50] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 1 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 1 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + 2 * r1], m6 + + ;row [31] + vbroadcasti128 m2, [r2 + 27] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 35] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 43] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 51] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 2 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 2 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r3], m6 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang32_25, 3, 5, 11 + mova m0, [pw_1024] + mova m1, [intra_pred_shuff_0_8] + lea r3, [3 * r1] + lea r4, [c_ang32_mode_25] + + ;row [0, 1] + vbroadcasti128 m2, [r2 + 0] + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 8] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 16] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 24] + pshufb m5, m1 + + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[2, 3] + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[4, 5] + mova m10, [r4 + 2 * mmsize] + lea r0, [r0 + 4 * r1] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[6, 7] + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[8, 9] + add r4, 4 * mmsize + lea r0, [r0 + 4 * r1] + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[10, 11] + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[12, 13] + mova m10, [r4 + 2 * mmsize] + lea r0, [r0 + 4 * r1] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[14, 15] + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[16, 17] + movu xm2, [r2 - 1] + pinsrb xm2, [r2 + 80], 0 + vinserti128 m2, m2, xm2, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 7] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 15] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 23] + pshufb m5, m1 + + add r4, 4 * mmsize + lea r0, [r0 + 4 * r1] + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[18, 19] + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[20, 21] + mova m10, [r4 + 2 * mmsize] + lea r0, [r0 + 4 * r1] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[22, 23] + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[24, 25] + add r4, 4 * mmsize + lea r0, [r0 + 4 * r1] + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[26, 27] + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[28, 29] + mova m10, [r4 + 2 * mmsize] + lea r0, [r0 + 4 * r1] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[30, 31] + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang32_24, 3, 5, 12 + mova m0, [pw_1024] + mova m1, [intra_pred_shuff_0_8] + lea r3, [3 * r1] + lea r4, [c_ang32_mode_24] + + ;row[0, 1] + vbroadcasti128 m11, [r2 + 0] + pshufb m2, m11, m1 + vbroadcasti128 m3, [r2 + 8] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 16] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 24] + pshufb m5, m1 + + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[2, 3] + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[4, 5] + mova m10, [r4 + 2 * mmsize] + lea r0, [r0 + 4 * r1] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[6, 7] + pslldq xm11, 1 + pinsrb xm11, [r2 + 70], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 7] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 15] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 23] + pshufb m5, m1 + + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[8, 9] + add r4, 4 * mmsize + lea r0, [r0 + 4 * r1] + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[10, 11] + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[12, 13] + pslldq xm11, 1 + pinsrb xm11, [r2 + 77], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 6] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 14] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 22] + pshufb m5, m1 + + mova m10, [r4 + 2 * mmsize] + lea r0, [r0 + 4 * r1] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[14, 15] + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[16, 17] + add r4, 4 * mmsize + lea r0, [r0 + 4 * r1] + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[18] + mova m10, [r4 + 1 * mmsize] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, m10 + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, m10 + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + 2 * r1], m6 + + ;row[19, 20] + pslldq xm11, 1 + pinsrb xm11, [r2 + 83], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 5] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 13] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 21] + pshufb m5, m1 + + mova m10, [r4 + 2 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r3], m7 + lea r0, [r0 + 4 * r1] + movu [r0], m6 + + ;row[21, 22] + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;row[23, 24] + add r4, 4 * mmsize + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r3], m7 + lea r0, [r0 + 4 * r1] + movu [r0], m6 + + ;row[25, 26] + pslldq xm11, 1 + pinsrb xm11, [r2 + 90], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 4] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 12] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 20] + pshufb m5, m1 + + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;row[27, 28] + mova m10, [r4 + 2 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r3], m7 + lea r0, [r0 + 4 * r1] + movu [r0], m6 + + ;row[29, 30] + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;[row 31] + mova m10, [r4 + 4 * mmsize] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, m10 + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, m10 + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r3], m6 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang32_23, 3, 5, 12 + mova m0, [pw_1024] + mova m1, [intra_pred_shuff_0_8] + lea r3, [3 * r1] + lea r4, [c_ang32_mode_23] + + ;row[0, 1] + vbroadcasti128 m11, [r2 + 0] + pshufb m2, m11, m1 + vbroadcasti128 m3, [r2 + 8] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 16] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 24] + pshufb m5, m1 + + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[2] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 1 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 1 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + 2 * r1], m6 + + ;row[3, 4] + pslldq xm11, 1 + pinsrb xm11, [r2 + 68], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 7] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 15] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 23] + pshufb m5, m1 + + mova m10, [r4 + 2 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r3], m7 + lea r0, [r0 + 4 * r1] + movu [r0], m6 + + ;row[5, 6] + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;row[7, 8] + pslldq xm11, 1 + pinsrb xm11, [r2 + 71], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 6] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 14] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 22] + pshufb m5, m1 + + add r4, 4 * mmsize + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r3], m7 + lea r0, [r0 + 4 * r1] + movu [r0], m6 + + ;row[9] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 1 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 1 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r1], m6 + + ;row[10, 11] + pslldq xm11, 1 + pinsrb xm11, [r2 + 75], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 5] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 13] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 21] + pshufb m5, m1 + + mova m10, [r4 + 2 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[12, 13] + lea r0, [r0 + 4 * r1] + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[14, 15] + pslldq xm11, 1 + pinsrb xm11, [r2 + 78], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 4] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 12] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 20] + pshufb m5, m1 + + add r4, 4 * mmsize + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[16] + lea r0, [r0 + 4 * r1] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 1 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 1 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0], m6 + + ;row[17, 18] + pslldq xm11, 1 + pinsrb xm11, [r2 + 82], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 3] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 11] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 19] + pshufb m5, m1 + + mova m10, [r4 + 2 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;row[19, 20] + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r3], m7 + lea r0, [r0 + 4 * r1] + movu [r0], m6 + + ;row[21, 22] + pslldq xm11, 1 + pinsrb xm11, [r2 + 85], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 2] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 10] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 18] + pshufb m5, m1 + + add r4, 4 * mmsize + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;row[23] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 1 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 1 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r3], m6 + + ;row[24, 25] + pslldq xm11, 1 + pinsrb xm11, [r2 + 89], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 1] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 9] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 17] + pshufb m5, m1 + + mova m10, [r4 + 2 * mmsize] + lea r0, [r0 + 4 * r1] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[26, 27] + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[28, 29] + pslldq xm11, 1 + pinsrb xm11, [r2 + 92], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 0] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 8] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 16] + pshufb m5, m1 + + add r4, 4 * mmsize + mova m10, [r4 + 0 * mmsize] + lea r0, [r0 + 4 * r1] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[30, 31] + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang32_22, 3, 5, 13 + mova m0, [pw_1024] + mova m1, [intra_pred_shuff_0_8] + lea r3, [3 * r1] + lea r4, [c_ang32_mode_22] + + ;row[0, 1] + vbroadcasti128 m11, [r2 + 0] + pshufb m2, m11, m1 + vbroadcasti128 m3, [r2 + 8] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 16] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 24] + pshufb m5, m1 + + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[2, 3] + pslldq xm11, 1 + pinsrb xm11, [r2 + 66], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 7] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 15] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 23] + pshufb m5, m1 + + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[4, 5] + pslldq xm11, 1 + pinsrb xm11, [r2 + 69], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 6] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 14] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 22] + pshufb m5, m1 + + lea r0, [r0 + 4 * r1] + mova m10, [r4 + 2 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[6] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 3 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 3 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + 2 * r1], m6 + + ;row[7, 8] + pslldq xm11, 1 + pinsrb xm11, [r2 + 71], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 5] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 13] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 21] + pshufb m5, m1 + + add r4, 4 * mmsize + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r3], m7 + lea r0, [r0 + 4 * r1] + movu [r0], m6 + + ;row[9, 10] + pslldq xm11, 1 + pinsrb xm11, [r2 + 74], 0 + vinserti128 m2, m11, xm11, 1 + vinserti128 m2, m2, xm2, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 4] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 12] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 20] + pshufb m5, m1 + + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;row[11] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 2 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 2 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r3], m6 + + ;row[12, 13] + pslldq xm11, 1 + pinsrb xm11, [r2 + 76], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 3] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 11] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 19] + pshufb m5, m1 + + mova m10, [r4 + 3 * mmsize] + lea r0, [r0 + 4 * r1] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[14, 15] + pslldq xm11, 1 + pinsrb xm11, [r2 + 79], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 2] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 10] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 18] + pshufb m5, m1 + + add r4, 4 * mmsize + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[16] + lea r0, [r0 + 4 * r1] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 1 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 1 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0], m6 + + ;row[17, 18] + pslldq xm11, 1 + pinsrb xm11, [r2 + 81], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 1] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 9] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 17] + pshufb m5, m1 + + mova m10, [r4 + 2 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;row[19, 20] + pslldq xm11, 1 + pinsrb xm11, [r2 + 84], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m12, [r2 + 0] + pshufb m3, m12, m1 + vbroadcasti128 m4, [r2 + 8] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 16] + pshufb m5, m1 + + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r3], m7 + lea r0, [r0 + 4 * r1] + movu [r0], m6 + + ;row[21] + add r4, 4 * mmsize + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 0 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 0 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r1], m6 + + ;row[22, 23] + pslldq xm11, 1 + pinsrb xm11, [r2 + 86], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + pslldq xm12, 1 + pinsrb xm12, [r2 + 66], 0 + vinserti128 m3, m12, xm12, 1 + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 7] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 15] + pshufb m5, m1 + + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[24, 25] + pslldq xm11, 1 + pinsrb xm11, [r2 + 89], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + pslldq xm12, 1 + pinsrb xm12, [r2 + 69], 0 + vinserti128 m3, m12, xm12, 1 + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 6] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 14] + pshufb m5, m1 + + mova m10, [r4 + 2 * mmsize] + lea r0, [r0 + 4 * r1] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[26] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 3 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 3 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + 2 * r1], m6 + + ;row[27, 28] + pslldq xm11, 1 + pinsrb xm11, [r2 + 91], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + pslldq xm12, 1 + pinsrb xm12, [r2 + 71], 0 + vinserti128 m3, m12, xm12, 1 + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 5] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 13] + pshufb m5, m1 + + add r4, 4 * mmsize + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r3], m7 + lea r0, [r0 + 4 * r1] + movu [r0], m6 + + ;row[29, 30] + pslldq xm11, 1 + pinsrb xm11, [r2 + 94], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + pslldq xm12, 1 + pinsrb xm12, [r2 + 74], 0 + vinserti128 m3, m12, xm12, 1 + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 4] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 12] + pshufb m5, m1 + + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;row[31] + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 2 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 2 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r3], m6 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang32_21, 3, 5, 13 + mova m0, [pw_1024] + mova m1, [intra_pred_shuff_0_8] + lea r3, [3 * r1] + lea r4, [c_ang32_mode_21] + + ;row[0] + vbroadcasti128 m11, [r2 + 0] + pshufb m2, m11, m1 + vbroadcasti128 m3, [r2 + 8] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 16] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 24] + pshufb m5, m1 + + vperm2i128 m6, m2, m3, 00100000b + pmaddubsw m6, [r4 + 0 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 0 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0], m6 + + ;row[1, 2] + pslldq xm11, 1 + pinsrb xm11, [r2 + 66], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 7] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 15] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 23] + pshufb m5, m1 + + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;row[3, 4] + pslldq xm11, 1 + pinsrb xm11, [r2 + 68], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 6] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 14] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 22] + pshufb m5, m1 + + mova m10, [r4 + 2 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r3], m7 + lea r0, [r0 + 4 * r1] + movu [r0], m6 + + ;row[5, 6] + pslldq xm11, 1 + pinsrb xm11, [r2 + 70], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 5] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 13] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 21] + pshufb m5, m1 + + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;row[7, 8] + pslldq xm11, 1 + pinsrb xm11, [r2 + 72], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 4] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 12] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 20] + pshufb m5, m1 + + add r4, 4 * mmsize + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r3], m7 + lea r0, [r0 + 4 * r1] + movu [r0], m6 + + ;row[9, 10] + pslldq xm11, 1 + pinsrb xm11, [r2 + 73], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 3] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 11] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 19] + pshufb m5, m1 + + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;row[11, 12] + pslldq xm11, 1 + pinsrb xm11, [r2 + 75], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 2] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 10] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 18] + pshufb m5, m1 + + mova m10, [r4 + 2 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r3], m7 + lea r0, [r0 + 4 * r1] + movu [r0], m6 + + ;row[13, 14] + pslldq xm11, 1 + pinsrb xm11, [r2 + 77], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m3, [r2 + 1] + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 9] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 17] + pshufb m5, m1 + + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + r1], m7 + movu [r0 + 2 * r1], m6 + + ;row[15] + pslldq xm11, 1 + pinsrb xm11, [r2 + 79], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + vbroadcasti128 m12, [r2 + 0] + pshufb m3, m12, m1 + vbroadcasti128 m4, [r2 + 8] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 16] + pshufb m5, m1 + vperm2i128 m6, m2, m3, 00100000b + add r4, 4 * mmsize + pmaddubsw m6, [r4 + 0 * mmsize] + pmulhrsw m6, m0 + vperm2i128 m7, m4, m5, 00100000b + pmaddubsw m7, [r4 + 0 * mmsize] + pmulhrsw m7, m0 + packuswb m6, m7 + vpermq m6, m6, 11011000b + movu [r0 + r3], m6 + + ;row[16, 17] + pslldq xm11, 1 + pinsrb xm11, [r2 + 81], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + pslldq xm12, 1 + pinsrb xm12, [r2 + 66], 0 + vinserti128 m3, m12, xm12, 1 + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 7] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 15] + pshufb m5, m1 + + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + lea r0, [r0 + 4 * r1] + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[18, 19] + pslldq xm11, 1 + pinsrb xm11, [r2 + 83], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + pslldq xm12, 1 + pinsrb xm12, [r2 + 68], 0 + vinserti128 m3, m12, xm12, 1 + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 6] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 14] + pshufb m5, m1 + + mova m10, [r4 + 2 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[20, 21] + pslldq xm11, 1 + pinsrb xm11, [r2 + 85], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + pslldq xm12, 1 + pinsrb xm12, [r2 + 70], 0 + vinserti128 m3, m12, xm12, 1 + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 5] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 13] + pshufb m5, m1 + + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + lea r0, [r0 + 4 * r1] + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[22, 23] + pslldq xm11, 1 + pinsrb xm11, [r2 + 87], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + pslldq xm12, 1 + pinsrb xm12, [r2 + 72], 0 + vinserti128 m3, m12, xm12, 1 + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 4] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 12] + pshufb m5, m1 + + add r4, 4 * mmsize + mova m10, [r4 + 0 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[24, 25] + pslldq xm11, 1 + pinsrb xm11, [r2 + 88], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + pslldq xm12, 1 + pinsrb xm12, [r2 + 73], 0 + vinserti128 m3, m12, xm12, 1 + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 3] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 11] + pshufb m5, m1 + + mova m10, [r4 + 1 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + lea r0, [r0 + 4 * r1] + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[26, 27] + pslldq xm11, 1 + pinsrb xm11, [r2 + 90], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + pslldq xm12, 1 + pinsrb xm12, [r2 + 75], 0 + vinserti128 m3, m12, xm12, 1 + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 2] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 10] + pshufb m5, m1 + + mova m10, [r4 + 2 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + + ;row[28, 29] + pslldq xm11, 1 + pinsrb xm11, [r2 + 92], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + pslldq xm12, 1 + pinsrb xm12, [r2 + 77], 0 + vinserti128 m3, m12, xm12, 1 + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 1] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 9] + pshufb m5, m1 + + mova m10, [r4 + 3 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + lea r0, [r0 + 4 * r1] + movu [r0], m7 + movu [r0 + r1], m6 + + ;row[30, 31] + pslldq xm11, 1 + pinsrb xm11, [r2 + 94], 0 + vinserti128 m2, m11, xm11, 1 + pshufb m2, m1 + pslldq xm12, 1 + pinsrb xm12, [r2 + 79], 0 + vinserti128 m3, m12, xm12, 1 + pshufb m3, m1 + vbroadcasti128 m4, [r2 + 0] + pshufb m4, m1 + vbroadcasti128 m5, [r2 + 8] + pshufb m5, m1 + + mova m10, [r4 + 4 * mmsize] + + INTRA_PRED_ANG32_CAL_ROW + movu [r0 + 2 * r1], m7 + movu [r0 + r3], m6 + RET %endif +%macro INTRA_PRED_STORE_4x4 0 + movd [r0], xm0 + pextrd [r0 + r1], xm0, 1 + vextracti128 xm0, m0, 1 + lea r0, [r0 + 2 * r1] + movd [r0], xm0 + pextrd [r0 + r1], xm0, 1 +%endmacro + +%macro INTRA_PRED_TRANS_STORE_4x4 0 + vpermq m0, m0, 00001000b + pshufb m0, [c_trans_4x4] + + ;store + movd [r0], xm0 + pextrd [r0 + r1], xm0, 1 + lea r0, [r0 + 2 * r1] + pextrd [r0], xm0, 2 + pextrd [r0 + r1], xm0, 3 +%endmacro + +INIT_YMM avx2 +cglobal intra_pred_ang4_27, 3, 3, 1 + vbroadcasti128 m0, [r2 + 1] + pshufb m0, [intra_pred_shuff_0_4] + pmaddubsw m0, [c_ang4_mode_27] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_28, 3, 3, 1 + vbroadcasti128 m0, [r2 + 1] + pshufb m0, [intra_pred_shuff_0_4] + pmaddubsw m0, [c_ang4_mode_28] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_29, 3, 3, 1 + vbroadcasti128 m0, [r2 + 1] + pshufb m0, [intra_pred4_shuff1] + pmaddubsw m0, [c_ang4_mode_29] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_30, 3, 3, 1 + vbroadcasti128 m0, [r2 + 1] + pshufb m0, [intra_pred4_shuff2] + pmaddubsw m0, [c_ang4_mode_30] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_31, 3, 3, 1 + vbroadcasti128 m0, [r2 + 1] + pshufb m0, [intra_pred4_shuff31] + pmaddubsw m0, [c_ang4_mode_31] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_32, 3, 3, 1 + vbroadcasti128 m0, [r2 + 1] + pshufb m0, [intra_pred4_shuff31] + pmaddubsw m0, [c_ang4_mode_32] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_33, 3, 3, 1 + vbroadcasti128 m0, [r2 + 1] + pshufb m0, [intra_pred4_shuff33] + pmaddubsw m0, [c_ang4_mode_33] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_STORE_4x4 + RET + + +INIT_YMM avx2 +cglobal intra_pred_ang4_3, 3, 3, 1 + vbroadcasti128 m0, [r2 + 1] + pshufb m0, [intra_pred4_shuff3] + pmaddubsw m0, [c_ang4_mode_33] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_TRANS_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_4, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff5] + pmaddubsw m0, [c_ang4_mode_32] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_TRANS_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_5, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff5] + pmaddubsw m0, [c_ang4_mode_5] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_TRANS_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_6, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff6] + pmaddubsw m0, [c_ang4_mode_6] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_TRANS_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_7, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff7] + pmaddubsw m0, [c_ang4_mode_7] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_TRANS_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_8, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff9] + pmaddubsw m0, [c_ang4_mode_8] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_TRANS_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_9, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff9] + pmaddubsw m0, [c_ang4_mode_9] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_TRANS_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_11, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff12] + pmaddubsw m0, [c_ang4_mode_11] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_TRANS_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_12, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff12] + pmaddubsw m0, [c_ang4_mode_12] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_TRANS_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_13, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff13] + pmaddubsw m0, [c_ang4_mode_13] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_TRANS_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_14, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff14] + pmaddubsw m0, [c_ang4_mode_14] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_TRANS_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_15, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff15] + pmaddubsw m0, [c_ang4_mode_15] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_TRANS_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_16, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff16] + pmaddubsw m0, [c_ang4_mode_16] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_TRANS_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_17, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff17] + pmaddubsw m0, [c_ang4_mode_17] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_TRANS_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_19, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff19] + pmaddubsw m0, [c_ang4_mode_19] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_20, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff20] + pmaddubsw m0, [c_ang4_mode_20] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_21, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff21] + pmaddubsw m0, [c_ang4_mode_21] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_22, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff22] + pmaddubsw m0, [c_ang4_mode_22] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_23, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred4_shuff23] + pmaddubsw m0, [c_ang4_mode_23] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_24, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred_shuff_0_4] + pmaddubsw m0, [c_ang4_mode_24] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_STORE_4x4 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang4_25, 3, 3, 1 + vbroadcasti128 m0, [r2] + pshufb m0, [intra_pred_shuff_0_4] + pmaddubsw m0, [c_ang4_mode_25] + pmulhrsw m0, [pw_1024] + packuswb m0, m0 + + INTRA_PRED_STORE_4x4 + RET
View file
x265_1.6.tar.gz/source/common/x86/intrapred8_allangs.asm -> x265_1.7.tar.gz/source/common/x86/intrapred8_allangs.asm
Changed
@@ -2,7 +2,7 @@ ;* Copyright (C) 2013 x265 project ;* ;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com> -;* Praveen Tiwari <praveen@multicorewareinc.com> +;* Praveen Tiwari <praveen@multicorewareinc.com> ;* ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by @@ -27,6 +27,64 @@ SECTION_RODATA 32 +all_ang4_shuff: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6 + db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7 + db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6 + db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5 + db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5 + db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4 + db 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3 + db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12 + db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 4, 0, 0, 9, 9, 10, 10, 11 + db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11 + db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 4, 2, 2, 0, 0, 9, 9, 10 + db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 3, 2, 2, 0, 0, 9, 9, 10 + db 0, 9, 9, 10, 10, 11, 11, 12, 1, 0, 0, 9, 9, 10, 10, 11, 2, 1, 1, 0, 0, 9, 9, 10, 4, 2, 2, 1, 1, 0, 0, 9 + db 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0, 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0 + db 0, 1, 1, 2, 2, 3, 3, 4, 9, 0, 0, 1, 1, 2, 2, 3, 10, 9, 9, 0, 0, 1, 1, 2, 12, 10, 10, 9, 9, 0, 0, 1 + db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 11, 10, 10, 0, 0, 1, 1, 2 + db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 12, 10, 10, 0, 0, 1, 1, 2 + db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3 + db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 12, 0, 0, 1, 1, 2, 2, 3 + db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4 + db 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4 + db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5 + db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6 + db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 2, 3, 3, 4, 4, 5, 5, 6 + db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7 + db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7, 4, 5, 5, 6, 6, 7, 7, 8 + db 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8 + +all_ang4: db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8 + db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20 + db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4 + db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20 + db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4 + db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20 + db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8 + db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24 + db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12 + db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28 + db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12 + db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28 + db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12 + db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24 + db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24 + db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12 + db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28 + db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12 + db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28 + db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12 + db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24 + db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8 + db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20 + db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4 + db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20 + db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4 + db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20 + db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8 + + SECTION .text ; global constant @@ -34,9 +92,14 @@ ; common constant with intrapred8.asm cextern ang_table +cextern pw_ang_table cextern tab_S1 cextern tab_S2 cextern tab_Si +cextern pw_16 +cextern pb_000000000000000F +cextern pb_0000000000000F0F +cextern pw_FFFFFFFFFFFFFFF0 ;----------------------------------------------------------------------------- @@ -23006,3 +23069,1098 @@ palignr m4, m2, m1, 14 movu [r0 + 2111 * 16], m4 RET + + +;----------------------------------------------------------------------------- +; void all_angs_pred_4x4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma) +;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal all_angs_pred_4x4, 4, 4, 6 + + mova m5, [pw_1024] + lea r2, [all_ang4] + lea r3, [all_ang4_shuff] + +; mode 2 + + vbroadcasti128 m0, [r1 + 9] + mova xm1, xm0 + psrldq xm1, 1 + pshufb xm1, [r3] + movu [r0], xm1 + +; mode 3 + + pshufb m1, m0, [r3 + 1 * mmsize] + pmaddubsw m1, [r2] + pmulhrsw m1, m5 + +; mode 4 + + pshufb m2, m0, [r3 + 2 * mmsize] + pmaddubsw m2, [r2 + 1 * mmsize] + pmulhrsw m2, m5 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r0 + (3 - 2) * 16], m1 + +; mode 5 + + pshufb m1, m0, [r3 + 2 * mmsize] + pmaddubsw m1, [r2 + 2 * mmsize] + pmulhrsw m1, m5 + +; mode 6 + + pshufb m2, m0, [r3 + 3 * mmsize] + pmaddubsw m2, [r2 + 3 * mmsize] + pmulhrsw m2, m5 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r0 + (5 - 2) * 16], m1 + + add r3, 4 * mmsize + add r2, 4 * mmsize + +; mode 7 + + pshufb m1, m0, [r3 + 0 * mmsize] + pmaddubsw m1, [r2 + 0 * mmsize] + pmulhrsw m1, m5 + +; mode 8 + + pshufb m2, m0, [r3 + 1 * mmsize] + pmaddubsw m2, [r2 + 1 * mmsize] + pmulhrsw m2, m5 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r0 + (7 - 2) * 16], m1 + +; mode 9 + + pshufb m1, m0, [r3 + 1 * mmsize] + pmaddubsw m1, [r2 + 2 * mmsize] + pmulhrsw m1, m5 + packuswb m1, m1 + vpermq m1, m1, 11011000b + movu [r0 + (9 - 2) * 16], xm1 + +; mode 10 + + pshufb xm1, xm0, [r3 + 2 * mmsize] + movu [r0 + (10 - 2) * 16], xm1 + + pxor xm1, xm1 + movd xm2, [r1 + 1] + pshufd xm3, xm2, 0 + punpcklbw xm3, xm1 + pinsrb xm2, [r1], 0 + pshufb xm4, xm2, xm1 + punpcklbw xm4, xm1 + psubw xm3, xm4 + psraw xm3, 1 + pshufb xm4, xm0, xm1 + punpcklbw xm4, xm1 + paddw xm3, xm4 + packuswb xm3, xm1 + + pextrb [r0 + 128], xm3, 0 + pextrb [r0 + 132], xm3, 1 + pextrb [r0 + 136], xm3, 2 + pextrb [r0 + 140], xm3, 3 + +; mode 11 + + vbroadcasti128 m0, [r1] + pshufb m1, m0, [r3 + 3 * mmsize] + pmaddubsw m1, [r2 + 3 * mmsize] + pmulhrsw m1, m5 + +; mode 12 + + add r2, 4 * mmsize + + pshufb m2, m0, [r3 + 3 * mmsize] + pmaddubsw m2, [r2 + 0 * mmsize] + pmulhrsw m2, m5 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r0 + (11 - 2) * 16], m1 + +; mode 13 + + add r3, 4 * mmsize + + pshufb m1, m0, [r3 + 0 * mmsize] + pmaddubsw m1, [r2 + 1 * mmsize] + pmulhrsw m1, m5 + +; mode 14 + + pshufb m2, m0, [r3 + 1 * mmsize] + pmaddubsw m2, [r2 + 2 * mmsize] + pmulhrsw m2, m5 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r0 + (13 - 2) * 16], m1 + +; mode 15 + + pshufb m1, m0, [r3 + 2 * mmsize] + pmaddubsw m1, [r2 + 3 * mmsize] + pmulhrsw m1, m5 + +; mode 16 + + add r2, 4 * mmsize + + pshufb m2, m0, [r3 + 3 * mmsize] + pmaddubsw m2, [r2 + 0 * mmsize] + pmulhrsw m2, m5 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r0 + (15 - 2) * 16], m1 + +; mode 17 + + add r3, 4 * mmsize + + pshufb m1, m0, [r3 + 0 * mmsize] + pmaddubsw m1, [r2 + 1 * mmsize] + pmulhrsw m1, m5 + packuswb m1, m1 + vpermq m1, m1, 11011000b + +; mode 18 + + pshufb m2, m0, [r3 + 1 * mmsize] + vinserti128 m1, m1, xm2, 1 + movu [r0 + (17 - 2) * 16], m1 + +; mode 19 + + pshufb m1, m0, [r3 + 2 * mmsize] + pmaddubsw m1, [r2 + 2 * mmsize] + pmulhrsw m1, m5 + +; mode 20 + + pshufb m2, m0, [r3 + 3 * mmsize] + pmaddubsw m2, [r2 + 3 * mmsize] + pmulhrsw m2, m5 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r0 + (19 - 2) * 16], m1 + +; mode 21 + + add r2, 4 * mmsize + add r3, 4 * mmsize + + pshufb m1, m0, [r3 + 0 * mmsize] + pmaddubsw m1, [r2 + 0 * mmsize] + pmulhrsw m1, m5 + +; mode 22 + + pshufb m2, m0, [r3 + 1 * mmsize] + pmaddubsw m2, [r2 + 1 * mmsize] + pmulhrsw m2, m5 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r0 + (21 - 2) * 16], m1 + +; mode 23 + + pshufb m1, m0, [r3 + 2 * mmsize] + pmaddubsw m1, [r2 + 2 * mmsize] + pmulhrsw m1, m5 + +; mode 24 + + pshufb m2, m0, [r3 + 3 * mmsize] + pmaddubsw m2, [r2 + 3 * mmsize] + pmulhrsw m2, m5 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r0 + (23 - 2) * 16], m1 + +; mode 25 + + add r2, 4 * mmsize + + pshufb m1, m0, [r3 + 3 * mmsize] + pmaddubsw m1, [r2 + 0 * mmsize] + pmulhrsw m1, m5 + packuswb m1, m1 + vpermq m1, m1, 11011000b + movu [r0 + (25 - 2) * 16], xm1 + +; mode 26 + + add r3, 4 * mmsize + + pshufb xm1, xm0, [r3 + 0 * mmsize] + movu [r0 + (26 - 2) * 16], xm1 + + pxor xm1, xm1 + movd xm2, [r1 + 9] + pshufd xm3, xm2, 0 + punpcklbw xm3, xm1 + pinsrb xm4, [r1 + 0], 0 + pshufb xm4, xm1 + punpcklbw xm4, xm1 + psubw xm3, xm4 + psraw xm3, 1 + psrldq xm2, xm0, 1 + pshufb xm2, xm1 + punpcklbw xm2, xm1 + paddw xm3, xm2 + packuswb xm3, xm1 + + pextrb [r0 + 384], xm3, 0 + pextrb [r0 + 388], xm3, 1 + pextrb [r0 + 392], xm3, 2 + pextrb [r0 + 396], xm3, 3 + +; mode 27 + + pshufb m1, m0, [r3 + 1 * mmsize] + pmaddubsw m1, [r2 + 1 * mmsize] + pmulhrsw m1, m5 + +; mode 28 + + pshufb m2, m0, [r3 + 1 * mmsize] + pmaddubsw m2, [r2 + 2 * mmsize] + pmulhrsw m2, m5 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r0 + (27 - 2) * 16], m1 + +; mode 29 + + pshufb m1, m0, [r3 + 2 * mmsize] + pmaddubsw m1, [r2 + 3 * mmsize] + pmulhrsw m1, m5 + +; mode 30 + + add r2, 4 * mmsize + + pshufb m2, m0, [r3 + 3 * mmsize] + pmaddubsw m2, [r2 + 0 * mmsize] + pmulhrsw m2, m5 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r0 + (29 - 2) * 16], m1 + +; mode 31 + + add r3, 4 * mmsize + + pshufb m1, m0, [r3 + 0 * mmsize] + pmaddubsw m1, [r2 + 1 * mmsize] + pmulhrsw m1, m5 + +; mode 32 + + pshufb m2, m0, [r3 + 0 * mmsize] + pmaddubsw m2, [r2 + 2 * mmsize] + pmulhrsw m2, m5 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r0 + (31 - 2) * 16], m1 + +; mode 33 + + pshufb m1, m0, [r3 + 1 * mmsize] + pmaddubsw m1, [r2 + 3 * mmsize] + pmulhrsw m1, m5 + packuswb m1, m2 + vpermq m1, m1, 11011000b + +; mode 34 + + pshufb m0, [r3 + 2 * mmsize] + vinserti128 m1, m1, xm0, 1 + movu [r0 + (33 - 2) * 16], m1 + RET + +;----------------------------------------------------------------------------- +; void all_angs_pred_4x4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma) +;----------------------------------------------------------------------------- +INIT_XMM sse2 +cglobal all_angs_pred_4x4, 4, 4, 8 + +; mode 2 + + movh m6, [r1 + 9] + mova m2, m6 + psrldq m2, 1 + movd [r0], m2 ;byte[A, B, C, D] + psrldq m2, 1 + movd [r0 + 4], m2 ;byte[B, C, D, E] + psrldq m2, 1 + movd [r0 + 8], m2 ;byte[C, D, E, F] + psrldq m2, 1 + movd [r0 + 12], m2 ;byte[D, E, F, G] + +; mode 10/26 + + pxor m7, m7 + pshufd m5, m6, 0 + mova [r0 + 128], m5 ;mode 10 byte[9, A, B, C, 9, A, B, C, 9, A, B, C, 9, A, B, C] + + movd m4, [r1 + 1] + pshufd m4, m4, 0 + mova [r0 + 384], m4 ;mode 26 byte[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4] + + movd m1, [r1] + punpcklbw m1, m7 + pshuflw m1, m1, 0x00 + punpcklqdq m1, m1 ;m1 = byte[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] + + punpckldq m4, m5 + punpcklbw m4, m7 ;m4 = word[1, 2, 3, 4, 9, A, B, C] + pshuflw m2, m4, 0x00 + pshufhw m2, m2, 0x00 ;m2 = word[1, 1, 1, 1, 9, 9, 9, 9] + + psubw m4, m1 + psraw m4, 1 + + pshufd m2, m2, q1032 ;m2 = word[9, 9, 9, 9, 1, 1, 1, 1] + paddw m4, m2 + packuswb m4, m4 + +%if ARCH_X86_64 + movq r2, m4 + + mov [r0 + 128], r2b ;mode 10 + shr r2, 8 + mov [r0 + 132], r2b + shr r2, 8 + mov [r0 + 136], r2b + shr r2, 8 + mov [r0 + 140], r2b + shr r2, 8 + mov [r0 + 384], r2b ;mode 26 + shr r2d, 8 + mov [r0 + 388], r2b + shr r2d, 8 + mov [r0 + 392], r2b + shr r2d, 8 + mov [r0 + 396], r2b + +%else + movd r2d, m4 + + mov [r0 + 128], r2b ;mode 10 + shr r2d, 8 + mov [r0 + 132], r2b + shr r2d, 8 + mov [r0 + 136], r2b + shr r2d, 8 + mov [r0 + 140], r2b + + psrldq m4, 4 + movd r2d, m4 + + mov [r0 + 384], r2b ;mode 26 + shr r2d, 8 + mov [r0 + 388], r2b + shr r2d, 8 + mov [r0 + 392], r2b + shr r2d, 8 + mov [r0 + 396], r2b +%endif + +; mode 3 + + mova m2, [pw_16] + lea r3, [pw_ang_table + 7 * 16] + lea r2, [pw_ang_table + 23 * 16] + punpcklbw m6, m6 + psrldq m6, 1 + movh m1, m6 + psrldq m6, 2 + movh m0, m6 + psrldq m6, 2 + movh m3, m6 + psrldq m6, 2 + punpcklbw m1, m7 ;m1 = word[9, A, A, B, B, C, C, D] + punpcklbw m0, m7 ;m0 = word[A, B, B, C, C, D, D, E] + punpcklbw m3, m7 ;m3 = word[B, C, C, D, D, E, E, F] + punpcklbw m6, m7 ;m6 = word[C, D, D, E, E, F, F, G] + + mova m7, [r2 - 3 * 16] + + pmaddwd m5, m1, [r2 + 3 * 16] + pmaddwd m4, m0, m7 + + packssdw m5, m4 + paddw m5, m2 + psraw m5, 5 + + pmaddwd m4, m3, [r3 + 7 * 16] + pmaddwd m6, [r3 + 1 * 16] + + packssdw m4, m6 + paddw m4, m2 + psraw m4, 5 + + packuswb m5, m4 + mova [r0 + 16], m5 + movd [r0 + 68], m5 ;mode 6 row 1 + psrldq m5, 4 + movd [r0 + 76], m5 ;mode 6 row 3 + +; mode 4 + + pmaddwd m4, m0, [r2 + 8 * 16] + pmaddwd m6, m3, m7 + + packssdw m4, m6 + paddw m4, m2 + psraw m4, 5 + + pmaddwd m5, m1, [r2 - 2 * 16] + pmaddwd m6, m0, [r3 + 3 * 16] + + packssdw m5, m6 + paddw m5, m2 + psraw m5, 5 + + packuswb m5, m4 + mova [r0 + 32], m5 + +; mode 5 + + pmaddwd m5, m1, [r2 - 6 * 16] + pmaddwd m6, m0, [r3 - 5 * 16] + + packssdw m5, m6 + paddw m5, m2 + psraw m5, 5 + + pmaddwd m4, m0, [r2 - 4 * 16] + pmaddwd m3, [r3 - 3 * 16] + + packssdw m4, m3 + paddw m4, m2 + psraw m4, 5 + + packuswb m5, m4 + mova [r0 + 48], m5 + +; mode 6 + + pmaddwd m5, m1, [r3 + 6 * 16] + pmaddwd m6, m0, [r3 + 0 * 16] + + packssdw m5, m6 + paddw m5, m2 + psraw m5, 5 + + packuswb m5, m6 + movd [r0 + 64], m5 + psrldq m5, 4 + movd [r0 + 72], m5 + +; mode 7 + + pmaddwd m5, m1, [r3 + 2 * 16] + pmaddwd m6, m1, [r2 - 5 * 16] + + packssdw m5, m6 + paddw m5, m2 + psraw m5, 5 + + mova m3, [r2 + 4 * 16] + pmaddwd m4, m1, m3 + pmaddwd m0, [r3 - 3 * 16] + + packssdw m4, m0 + paddw m4, m2 + psraw m4, 5 + + packuswb m5, m4 + mova [r0 + 80], m5 + +; mode 8 + + mova m0, [r3 - 2 * 16] + pmaddwd m5, m1, m0 + pmaddwd m6, m1, [r3 + 3 * 16] + + packssdw m5, m6 + paddw m5, m2 + psraw m5, 5 + + pmaddwd m4, m1, [r3 + 8 * 16] + pmaddwd m7, m1 + + packssdw m4, m7 + paddw m4, m2 + psraw m4, 5 + + packuswb m5, m4 + mova [r0 + 96], m5 + +; mode 9 + + pmaddwd m5, m1, [r3 - 5 * 16] + pmaddwd m6, m1, [r3 - 3 * 16] + + packssdw m5, m6 + paddw m5, m2 + psraw m5, 5 + + pmaddwd m4, m1, [r3 - 1 * 16] + pmaddwd m6, m1, [r3 + 1 * 16] + + packssdw m4, m6 + paddw m4, m2 + psraw m4, 5 + + packuswb m5, m4 + mova [r0 + 112], m5 + +; mode 11 + + movd m5, [r1] + punpcklwd m5, m1 + pand m5, [pb_0000000000000F0F] + pslldq m1, 4 + por m1, m5 ;m1 = word[0, 9, 9, A, A, B, B, C] + + pmaddwd m5, m1, [r2 + 7 * 16] + pmaddwd m6, m1, [r2 + 5 * 16] + + packssdw m5, m6 + paddw m5, m2 + psraw m5, 5 + + pmaddwd m4, m1, [r2 + 3 * 16] + pmaddwd m6, m1, [r2 + 1 * 16] + + packssdw m4, m6 + paddw m4, m2 + psraw m4, 5 + + packuswb m5, m4 + mova [r0 + 144], m5 + +; mode 12 + + pmaddwd m3, m1 + pmaddwd m6, m1, [r2 - 1 * 16] + + packssdw m3, m6 + paddw m3, m2 + psraw m3, 5 + + pmaddwd m4, m1, [r2 - 6 * 16] + pmaddwd m6, m1, [r3 + 5 * 16] + + packssdw m4, m6 + paddw m4, m2 + psraw m4, 5 + + packuswb m3, m4 + mova [r0 + 160], m3 + +; mode 13 + + mova m3, m1 + movd m7, [r1 + 4] + punpcklwd m7, m1 + pand m7, [pb_0000000000000F0F] + pslldq m3, 4 + por m3, m7 ;m3 = word[4, 0, 0, 9, 9, A, A, B] + + pmaddwd m5, m1, [r2 + 0 * 16] + pmaddwd m6, m1, [r3 + 7 * 16] + + packssdw m5, m6 + paddw m5, m2 + psraw m5, 5 + + pmaddwd m4, m1, m0 + pmaddwd m6, m3, [r2 + 5 * 16] + + packssdw m4, m6 + paddw m4, m2 + psraw m4, 5 + + packuswb m5, m4 + mova [r0 + 176], m5 + +; mode 14 + + pmaddwd m5, m1, [r2 - 4 * 16] + pmaddwd m6, m1, [r3 - 1 * 16] + + packssdw m5, m6 + paddw m5, m2 + psraw m5, 5 + + movd m6, [r1 + 2] + pand m3, [pw_FFFFFFFFFFFFFFF0] + pand m6, [pb_000000000000000F] + por m3, m6 ;m3 = word[2, 0, 0, 9, 9, A, A, B] + + pmaddwd m4, m3, [r2 + 2 * 16] + pmaddwd m6, m3, [r3 + 5 * 16] + + packssdw m4, m6 + paddw m4, m2 + psraw m4, 5 + + packuswb m5, m4 + mova [r0 + 192], m5 + psrldq m5, 4 + movd [r0 + 240], m5 ;mode 17 row 0 + +; mode 15 + + pmaddwd m5, m1, [r3 + 8 * 16] + pmaddwd m6, m3, [r2 + 7 * 16] + + packssdw m5, m6 + paddw m5, m2 + psraw m5, 5 + + pmaddwd m6, m3, [r3 + 6 * 16] + + mova m0, m3 + punpcklwd m7, m3 + pslldq m0, 4 + pand m7, [pb_0000000000000F0F] + por m0, m7 ;m0 = word[4, 2, 2, 0, 0, 9, 9, A] + + pmaddwd m4, m0, [r2 + 5 * 16] + + packssdw m6, m4 + paddw m6, m2 + psraw m6, 5 + + packuswb m5, m6 + mova [r0 + 208], m5 + +; mode 16 + + pmaddwd m5, m1, [r3 + 4 * 16] + pmaddwd m6, m3, [r2 - 1 * 16] + + packssdw m5, m6 + paddw m5, m2 + psraw m5, 5 + + pmaddwd m3, [r3 - 6 * 16] + + movd m6, [r1 + 3] + pand m0, [pw_FFFFFFFFFFFFFFF0] + pand m6, [pb_000000000000000F] + por m0, m6 ;m0 = word[3, 2, 2, 0, 0, 9, 9, A] + + pmaddwd m0, [r3 + 5 * 16] + packssdw m3, m0 + paddw m3, m2 + psraw m3, 5 + + packuswb m5, m3 + mova [r0 + 224], m5 + +; mode 17 + + movd m4, [r1 + 1] + punpcklwd m4, m1 + pand m4, [pb_0000000000000F0F] + pslldq m1, 4 + por m1, m4 ;m1 = word[1, 0, 0, 9, 9, A, A, B] + + pmaddwd m6, m1, [r3 + 5 * 16] + + packssdw m6, m6 + paddw m6, m2 + psraw m6, 5 + + movd m5, [r1 + 2] + punpcklwd m5, m1 + pand m5, [pb_0000000000000F0F] + pslldq m1, 4 + por m1, m5 ;m1 = word[2, 1, 1, 0, 0, 9, 9, A] + + pmaddwd m4, m1, [r2 - 5 * 16] + + punpcklwd m7, m1 + pand m7, [pb_0000000000000F0F] + pslldq m1, 4 + por m1, m7 ;m1 = word[4, 2, 2, 1, 1, 0, 0, 9] + + pmaddwd m1, [r2 + 1 * 16] + packssdw m4, m1 + paddw m4, m2 + psraw m4, 5 + + packuswb m6, m4 + movd [r0 + 244], m6 + psrldq m6, 8 + movh [r0 + 248], m6 + +; mode 18 + + movh m1, [r1] + movd [r0 + 256], m1 ;byte[0, 1, 2, 3] + + movh m3, [r1 + 2] + punpcklqdq m3, m1 + psrldq m3, 7 + movd [r0 + 260], m3 ;byte[2, 1, 0, 9] + + movh m4, [r1 + 3] + punpcklqdq m4, m3 + psrldq m4, 7 + movd [r0 + 264], m4 ;byte[1, 0, 9, A] + + movh m0, [r1 + 4] + punpcklqdq m0, m4 + psrldq m0, 7 + movd [r0 + 268], m0 ;byte[0, 9, A, B] + +; mode 19 + + pxor m7, m7 + punpcklbw m4, m3 + punpcklbw m3, m1 + punpcklbw m1, m1 + punpcklbw m4, m7 ;m4 = word[A, 9, 9, 0, 0, 1, 1, 2] + punpcklbw m3, m7 ;m3 = word[9, 0, 0, 1, 1, 2, 2, 3] + psrldq m1, 1 + punpcklbw m1, m7 ;m1 = word[0, 1, 1, 2, 2, 3, 3, 4] + + pmaddwd m6, m1, [r3 - 1 * 16] + pmaddwd m7, m3, [r3 + 5 * 16] + + packssdw m6, m7 + paddw m6, m2 + psraw m6, 5 + + pmaddwd m5, m4, [r2 - 5 * 16] + + movd m7, [r1 + 12] + punpcklwd m7, m4 + pand m7, [pb_0000000000000F0F] + pslldq m4, 4 + por m4, m7 ;m4 = word[C, A, A, 9, 9, 0, 0, 1] + + pmaddwd m4, [r2 + 1 * 16] + packssdw m5, m4 + paddw m5, m2 + psraw m5, 5 + + packuswb m6, m5 + mova [r0 + 272], m6 + movd [r0 + 324], m6 ;mode 22 row 1 + +; mode 20 + + pmaddwd m5, m1, [r3 + 4 * 16] + + movd m4, [r1 + 10] + pand m3, [pw_FFFFFFFFFFFFFFF0] + pand m4, [pb_000000000000000F] + por m3, m4 ;m3 = word[A, 0, 0, 1, 1, 2, 2, 3] + + pmaddwd m6, m3, [r2 - 1 * 16] + + packssdw m5, m6 + paddw m5, m2 + psraw m5, 5 + + pmaddwd m4, m3, [r3 - 6 * 16] + + punpcklwd m0, m3 + pand m0, [pb_0000000000000F0F] + mova m6, m3 + pslldq m6, 4 + por m0, m6 ;m0 = word[B, A, A, 0, 0, 1, 1, 2] + + pmaddwd m6, m0, [r3 + 5 * 16] + + packssdw m4, m6 + paddw m4, m2 + psraw m4, 5 + + packuswb m5, m4 + mova [r0 + 288], m5 + +; mode 21 + + pmaddwd m4, m1, [r3 + 8 * 16] + pmaddwd m6, m3, [r2 + 7 * 16] + + packssdw m4, m6 + paddw m4, m2 + psraw m4, 5 + + pmaddwd m5, m3, [r3 + 6 * 16] + + pand m0, [pw_FFFFFFFFFFFFFFF0] + pand m7, [pb_000000000000000F] + por m0, m7 ;m0 = word[C, A, A, 0, 0, 1, 1, 2] + + pmaddwd m0, [r2 + 5 * 16] + packssdw m5, m0 + paddw m5, m2 + psraw m5, 5 + + packuswb m4, m5 + mova [r0 + 304], m4 + +; mode 22 + + pmaddwd m4, m1, [r2 - 4 * 16] + packssdw m4, m4 + paddw m4, m2 + psraw m4, 5 + + mova m0, [r3 + 5 * 16] + pmaddwd m5, m3, [r2 + 2 * 16] + pmaddwd m6, m3, m0 + + packssdw m5, m6 + paddw m5, m2 + psraw m5, 5 + + packuswb m4, m5 + movd [r0 + 320], m4 + psrldq m4, 8 + movh [r0 + 328], m4 + +; mode 23 + + pmaddwd m4, m1, [r2 + 0 * 16] + pmaddwd m5, m1, [r3 + 7 * 16] + + packssdw m4, m5 + paddw m4, m2 + psraw m4, 5 + + pmaddwd m6, m1, [r3 - 2 * 16] + + pand m3, [pw_FFFFFFFFFFFFFFF0] + por m3, m7 ;m3 = word[C, 0, 0, 1, 1, 2, 2, 3] + + pmaddwd m3, [r2 + 5 * 16] + packssdw m6, m3 + paddw m6, m2 + psraw m6, 5 + + packuswb m4, m6 + mova [r0 + 336], m4 + +; mode 24 + + pmaddwd m4, m1, [r2 + 4 * 16] + pmaddwd m5, m1, [r2 - 1 * 16] + + packssdw m4, m5 + paddw m4, m2 + psraw m4, 5 + + pmaddwd m6, m1, [r2 - 6 * 16] + pmaddwd m0, m1 + + packssdw m6, m0 + paddw m6, m2 + psraw m6, 5 + + packuswb m4, m6 + mova [r0 + 352], m4 + +; mode 25 + + pmaddwd m4, m1, [r2 + 7 * 16] + pmaddwd m5, m1, [r2 + 5 * 16] + + packssdw m4, m5 + paddw m4, m2 + psraw m4, 5 + + pmaddwd m6, m1, [r2 + 3 * 16] + pmaddwd m1, [r2 + 1 * 16] + + packssdw m6, m1 + paddw m6, m2 + psraw m6, 5 + + packuswb m4, m6 + mova [r0 + 368], m4 + +; mode 27 + + movh m0, [r1 + 1] + pxor m7, m7 + punpcklbw m0, m0 + psrldq m0, 1 + movh m1, m0 + psrldq m0, 2 + movh m3, m0 + psrldq m0, 2 + punpcklbw m1, m7 ;m1 = word[1, 2, 2, 3, 3, 4, 4, 5] + punpcklbw m3, m7 ;m3 = word[2, 3, 3, 4, 4, 5, 5, 6] + punpcklbw m0, m7 ;m0 = word[3, 4, 4, 5, 5, 6, 6, 7] + + mova m7, [r3 - 3 * 16] + + pmaddwd m4, m1, [r3 - 5 * 16] + pmaddwd m5, m1, m7 + + packssdw m4, m5 + paddw m4, m2 + psraw m4, 5 + + pmaddwd m6, m1, [r3 - 1 * 16] + pmaddwd m5, m1, [r3 + 1 * 16] + + packssdw m6, m5 + paddw m6, m2 + psraw m6, 5 + + packuswb m4, m6 + mova [r0 + 400], m4 + +; mode 28 + + pmaddwd m4, m1, [r3 - 2 * 16] + pmaddwd m5, m1, [r3 + 3 * 16] + + packssdw m4, m5 + paddw m4, m2 + psraw m4, 5 + + pmaddwd m6, m1, [r3 + 8 * 16] + pmaddwd m5, m1, [r2 - 3 * 16] + + packssdw m6, m5 + paddw m6, m2 + psraw m6, 5 + + packuswb m4, m6 + mova [r0 + 416], m4 + +; mode 29 + + pmaddwd m4, m1, [r3 + 2 * 16] + pmaddwd m6, m1, [r2 - 5 * 16] + + packssdw m4, m6 + paddw m4, m2 + psraw m4, 5 + + pmaddwd m6, m1, [r2 + 4 * 16] + pmaddwd m5, m3, m7 + + packssdw m6, m5 + paddw m6, m2 + psraw m6, 5 + + packuswb m4, m6 + mova [r0 + 432], m4 + +; mode 30 + + pmaddwd m4, m1, [r3 + 6 * 16] + pmaddwd m5, m1, [r2 + 3 * 16] + + packssdw m4, m5 + paddw m4, m2 + psraw m4, 5 + + pmaddwd m6, m3, [r3 + 0 * 16] + pmaddwd m5, m3, [r2 - 3 * 16] + + packssdw m6, m5 + paddw m6, m2 + psraw m6, 5 + + packuswb m4, m6 + mova [r0 + 448], m4 + psrldq m4, 4 + movh [r0 + 496], m4 ;mode 33 row 0 + psrldq m4, 8 + movd [r0 + 500], m4 ;mode 33 row 1 + +; mode 31 + + pmaddwd m4, m1, [r2 - 6 * 16] + pmaddwd m5, m3, [r3 - 5 * 16] + + packssdw m4, m5 + paddw m4, m2 + psraw m4, 5 + + pmaddwd m6, m3, [r2 - 4 * 16] + pmaddwd m7, m0 + + packssdw m6, m7 + paddw m6, m2 + psraw m6, 5 + + packuswb m4, m6 + mova [r0 + 464], m4 + +; mode 32 + + pmaddwd m1, [r2 - 2 * 16] + pmaddwd m5, m3, [r3 + 3 * 16] + + packssdw m1, m5 + paddw m1, m2 + psraw m1, 5 + + pmaddwd m3, [r2 + 8 * 16] + pmaddwd m5, m0, [r2 - 3 * 16] + packssdw m3, m5 + paddw m3, m2 + psraw m3, 5 + + packuswb m1, m3 + mova [r0 + 480], m1 + +; mode 33 + + pmaddwd m0, [r3 + 7 * 16] + pxor m7, m7 + movh m4, [r1 + 4] + punpcklbw m4, m4 + psrldq m4, 1 + punpcklbw m4, m7 + + pmaddwd m4, [r3 + 1 * 16] + + packssdw m0, m4 + paddw m0, m2 + psraw m0, 5 + + packuswb m0, m0 + movh [r0 + 504], m0 + +; mode 34 + + movh m7, [r1 + 2] + movd [r0 + 512], m7 ;byte[2, 3, 4, 5] + + psrldq m7, 1 + movd [r0 + 516], m7 ;byte[3, 4, 5, 6] + + psrldq m7, 1 + movd [r0 + 520], m7 ;byte[4, 5, 6, 7] + + psrldq m7, 1 + movd [r0 + 524], m7 ;byte[5, 6, 7, 8] + +RET
View file
x265_1.6.tar.gz/source/common/x86/ipfilter16.asm -> x265_1.7.tar.gz/source/common/x86/ipfilter16.asm
Changed
@@ -113,10 +113,13 @@ times 8 dw 58, -10 times 8 dw 4, -1 +const interp8_hps_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7 + SECTION .text cextern pd_32 cextern pw_pixel_max cextern pd_n32768 +cextern pw_2000 ;------------------------------------------------------------------------------------------------------------ ; void interp_8tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -5525,65 +5528,1409 @@ FILTER_VER_LUMA_SS 64, 16 FILTER_VER_LUMA_SS 16, 64 -;-------------------------------------------------------------------------------------------------- -; void filterConvertPelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height) -;-------------------------------------------------------------------------------------------------- -INIT_XMM sse2 -cglobal luma_p2s, 3, 7, 5 +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_2xN 1 +INIT_XMM sse4 +cglobal filterPixelToShort_2x%1, 3, 6, 2 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r1 * 3] + lea r5, [r3 * 3] - add r1, r1 + ; load constant + mova m1, [pw_2000] - ; load width and height - mov r3d, r3m - mov r4d, r4m +%rep %1/4 + movd m0, [r0] + movhps m0, [r0 + r1] + psllw m0, 4 + psubw m0, m1 + + movd [r2 + r3 * 0], m0 + pextrd [r2 + r3 * 1], m0, 2 + + movd m0, [r0 + r1 * 2] + movhps m0, [r0 + r4] + psllw m0, 4 + psubw m0, m1 + + movd [r2 + r3 * 2], m0 + pextrd [r2 + r5], m0, 2 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + RET +%endmacro +P2S_H_2xN 4 +P2S_H_2xN 8 +P2S_H_2xN 16 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_4xN 1 +INIT_XMM ssse3 +cglobal filterPixelToShort_4x%1, 3, 6, 2 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] ; load constant - mova m4, [tab_c_n8192] + mova m1, [pw_2000] -.loopH: +%rep %1/4 + movh m0, [r0] + movhps m0, [r0 + r1] + psllw m0, 4 + psubw m0, m1 + movh [r2 + r3 * 0], m0 + movhps [r2 + r3 * 1], m0 + + movh m0, [r0 + r1 * 2] + movhps m0, [r0 + r5] + psllw m0, 4 + psubw m0, m1 + movh [r2 + r3 * 2], m0 + movhps [r2 + r4], m0 - xor r5d, r5d -.loopW: - lea r6, [r0 + r5 * 2] + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + RET +%endmacro +P2S_H_4xN 4 +P2S_H_4xN 8 +P2S_H_4xN 16 +P2S_H_4xN 32 - movu m0, [r6] - psllw m0, 4 - paddw m0, m4 +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +INIT_XMM ssse3 +cglobal filterPixelToShort_4x2, 3, 4, 1 + add r1d, r1d + mov r3d, r3m + add r3d, r3d - movu m1, [r6 + r1] - psllw m1, 4 - paddw m1, m4 + movh m0, [r0] + movhps m0, [r0 + r1] + psllw m0, 4 + psubw m0, [pw_2000] + movh [r2 + r3 * 0], m0 + movhps [r2 + r3 * 1], m0 - movu m2, [r6 + r1 * 2] - psllw m2, 4 - paddw m2, m4 - - lea r6, [r6 + r1 * 2] - movu m3, [r6 + r1] - psllw m3, 4 - paddw m3, m4 + RET - add r5, 8 - cmp r5, r3 - jg .width4 - movu [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0 - movu [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1 - movu [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2 - movu [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3 - je .nextH - jmp .loopW +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_6xN 1 +INIT_XMM sse4 +cglobal filterPixelToShort_6x%1, 3, 7, 3 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] -.width4: - movh [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0 - movh [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1 - movh [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2 - movh [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3 + ; load height + mov r6d, %1/4 -.nextH: - lea r0, [r0 + r1 * 4] - add r2, FENC_STRIDE * 8 + ; load constant + mova m2, [pw_2000] - sub r4d, 4 - jnz .loopH +.loop + movu m0, [r0] + movu m1, [r0 + r1] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movh [r2 + r3 * 0], m0 + pextrd [r2 + r3 * 0 + 8], m0, 2 + movh [r2 + r3 * 1], m1 + pextrd [r2 + r3 * 1 + 8], m1, 2 + + movu m0, [r0 + r1 * 2] + movu m1, [r0 + r5] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movh [r2 + r3 * 2], m0 + pextrd [r2 + r3 * 2 + 8], m0, 2 + movh [r2 + r4], m1 + pextrd [r2 + r4 + 8], m1, 2 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET +%endmacro +P2S_H_6xN 8 +P2S_H_6xN 16 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_8xN 1 +INIT_XMM ssse3 +cglobal filterPixelToShort_8x%1, 3, 7, 2 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load height + mov r6d, %1/4 + + ; load constant + mova m1, [pw_2000] + +.loop + movu m0, [r0] + psllw m0, 4 + psubw m0, m1 + movu [r2 + r3 * 0], m0 + + movu m0, [r0 + r1] + psllw m0, 4 + psubw m0, m1 + movu [r2 + r3 * 1], m0 + + movu m0, [r0 + r1 * 2] + psllw m0, 4 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + movu m0, [r0 + r5] + psllw m0, 4 + psubw m0, m1 + movu [r2 + r4], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET +%endmacro +P2S_H_8xN 8 +P2S_H_8xN 4 +P2S_H_8xN 16 +P2S_H_8xN 32 +P2S_H_8xN 12 +P2S_H_8xN 64 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +INIT_XMM ssse3 +cglobal filterPixelToShort_8x2, 3, 4, 2 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + + movu m0, [r0] + movu m1, [r0 + r1] + + psllw m0, 4 + psubw m0, [pw_2000] + psllw m1, 4 + psubw m1, [pw_2000] + + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + + RET + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +INIT_XMM ssse3 +cglobal filterPixelToShort_8x6, 3, 7, 4 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r1 * 3] + lea r5, [r1 * 5] + lea r6, [r3 * 3] + + ; load constant + mova m3, [pw_2000] + + movu m0, [r0] + movu m1, [r0 + r1] + movu m2, [r0 + r1 * 2] + + psllw m0, 4 + psubw m0, m3 + psllw m1, 4 + psubw m1, m3 + psllw m2, 4 + psubw m2, m3 + + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + movu [r2 + r3 * 2], m2 + + movu m0, [r0 + r4] + movu m1, [r0 + r1 * 4] + movu m2, [r0 + r5 ] + + psllw m0, 4 + psubw m0, m3 + psllw m1, 4 + psubw m1, m3 + psllw m2, 4 + psubw m2, m3 + + movu [r2 + r6], m0 + movu [r2 + r3 * 4], m1 + lea r2, [r2 + r3 * 4] + movu [r2 + r3], m2 + + RET + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_16xN 1 +INIT_XMM ssse3 +cglobal filterPixelToShort_16x%1, 3, 7, 3 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load height + mov r6d, %1/4 + + ; load constant + mova m2, [pw_2000] + +.loop + movu m0, [r0] + movu m1, [r0 + r1] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + + movu m0, [r0 + r1 * 2] + movu m1, [r0 + r5] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 2], m0 + movu [r2 + r4], m1 + + movu m0, [r0 + 16] + movu m1, [r0 + r1 + 16] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 0 + 16], m0 + movu [r2 + r3 * 1 + 16], m1 + + movu m0, [r0 + r1 * 2 + 16] + movu m1, [r0 + r5 + 16] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 2 + 16], m0 + movu [r2 + r4 + 16], m1 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET +%endmacro +P2S_H_16xN 16 +P2S_H_16xN 4 +P2S_H_16xN 8 +P2S_H_16xN 12 +P2S_H_16xN 32 +P2S_H_16xN 64 +P2S_H_16xN 24 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_16xN_avx2 1 +INIT_YMM avx2 +cglobal filterPixelToShort_16x%1, 3, 7, 3 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load height + mov r6d, %1/4 + + ; load constant + mova m2, [pw_2000] + +.loop + movu m0, [r0] + movu m1, [r0 + r1] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + + movu m0, [r0 + r1 * 2] + movu m1, [r0 + r5] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 2], m0 + movu [r2 + r4], m1 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET +%endmacro +P2S_H_16xN_avx2 16 +P2S_H_16xN_avx2 4 +P2S_H_16xN_avx2 8 +P2S_H_16xN_avx2 12 +P2S_H_16xN_avx2 32 +P2S_H_16xN_avx2 64 +P2S_H_16xN_avx2 24 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_32xN 1 +INIT_XMM ssse3 +cglobal filterPixelToShort_32x%1, 3, 7, 5 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load height + mov r6d, %1/4 + + ; load constant + mova m4, [pw_2000] + +.loop + movu m0, [r0] + movu m1, [r0 + r1] + movu m2, [r0 + r1 * 2] + movu m3, [r0 + r5] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r4], m3 + + movu m0, [r0 + 16] + movu m1, [r0 + r1 + 16] + movu m2, [r0 + r1 * 2 + 16] + movu m3, [r0 + r5 + 16] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 16], m0 + movu [r2 + r3 * 1 + 16], m1 + movu [r2 + r3 * 2 + 16], m2 + movu [r2 + r4 + 16], m3 + + movu m0, [r0 + 32] + movu m1, [r0 + r1 + 32] + movu m2, [r0 + r1 * 2 + 32] + movu m3, [r0 + r5 + 32] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 32], m0 + movu [r2 + r3 * 1 + 32], m1 + movu [r2 + r3 * 2 + 32], m2 + movu [r2 + r4 + 32], m3 + + movu m0, [r0 + 48] + movu m1, [r0 + r1 + 48] + movu m2, [r0 + r1 * 2 + 48] + movu m3, [r0 + r5 + 48] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 48], m0 + movu [r2 + r3 * 1 + 48], m1 + movu [r2 + r3 * 2 + 48], m2 + movu [r2 + r4 + 48], m3 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET +%endmacro +P2S_H_32xN 32 +P2S_H_32xN 8 +P2S_H_32xN 16 +P2S_H_32xN 24 +P2S_H_32xN 64 +P2S_H_32xN 48 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_32xN_avx2 1 +INIT_YMM avx2 +cglobal filterPixelToShort_32x%1, 3, 7, 3 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load height + mov r6d, %1/4 + + ; load constant + mova m2, [pw_2000] + +.loop + movu m0, [r0] + movu m1, [r0 + r1] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + + movu m0, [r0 + r1 * 2] + movu m1, [r0 + r5] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 2], m0 + movu [r2 + r4], m1 + + movu m0, [r0 + 32] + movu m1, [r0 + r1 + 32] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 0 + 32], m0 + movu [r2 + r3 * 1 + 32], m1 + + movu m0, [r0 + r1 * 2 + 32] + movu m1, [r0 + r5 + 32] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 2 + 32], m0 + movu [r2 + r4 + 32], m1 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET +%endmacro +P2S_H_32xN_avx2 32 +P2S_H_32xN_avx2 8 +P2S_H_32xN_avx2 16 +P2S_H_32xN_avx2 24 +P2S_H_32xN_avx2 64 +P2S_H_32xN_avx2 48 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_64xN 1 +INIT_XMM ssse3 +cglobal filterPixelToShort_64x%1, 3, 7, 5 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load height + mov r6d, %1/4 + + ; load constant + mova m4, [pw_2000] + +.loop + movu m0, [r0] + movu m1, [r0 + r1] + movu m2, [r0 + r1 * 2] + movu m3, [r0 + r5] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r4], m3 + + movu m0, [r0 + 16] + movu m1, [r0 + r1 + 16] + movu m2, [r0 + r1 * 2 + 16] + movu m3, [r0 + r5 + 16] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 16], m0 + movu [r2 + r3 * 1 + 16], m1 + movu [r2 + r3 * 2 + 16], m2 + movu [r2 + r4 + 16], m3 + + movu m0, [r0 + 32] + movu m1, [r0 + r1 + 32] + movu m2, [r0 + r1 * 2 + 32] + movu m3, [r0 + r5 + 32] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 32], m0 + movu [r2 + r3 * 1 + 32], m1 + movu [r2 + r3 * 2 + 32], m2 + movu [r2 + r4 + 32], m3 + + movu m0, [r0 + 48] + movu m1, [r0 + r1 + 48] + movu m2, [r0 + r1 * 2 + 48] + movu m3, [r0 + r5 + 48] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 48], m0 + movu [r2 + r3 * 1 + 48], m1 + movu [r2 + r3 * 2 + 48], m2 + movu [r2 + r4 + 48], m3 + + movu m0, [r0 + 64] + movu m1, [r0 + r1 + 64] + movu m2, [r0 + r1 * 2 + 64] + movu m3, [r0 + r5 + 64] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 64], m0 + movu [r2 + r3 * 1 + 64], m1 + movu [r2 + r3 * 2 + 64], m2 + movu [r2 + r4 + 64], m3 + + movu m0, [r0 + 80] + movu m1, [r0 + r1 + 80] + movu m2, [r0 + r1 * 2 + 80] + movu m3, [r0 + r5 + 80] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 80], m0 + movu [r2 + r3 * 1 + 80], m1 + movu [r2 + r3 * 2 + 80], m2 + movu [r2 + r4 + 80], m3 + + movu m0, [r0 + 96] + movu m1, [r0 + r1 + 96] + movu m2, [r0 + r1 * 2 + 96] + movu m3, [r0 + r5 + 96] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 96], m0 + movu [r2 + r3 * 1 + 96], m1 + movu [r2 + r3 * 2 + 96], m2 + movu [r2 + r4 + 96], m3 + + movu m0, [r0 + 112] + movu m1, [r0 + r1 + 112] + movu m2, [r0 + r1 * 2 + 112] + movu m3, [r0 + r5 + 112] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 112], m0 + movu [r2 + r3 * 1 + 112], m1 + movu [r2 + r3 * 2 + 112], m2 + movu [r2 + r4 + 112], m3 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET +%endmacro +P2S_H_64xN 64 +P2S_H_64xN 16 +P2S_H_64xN 32 +P2S_H_64xN 48 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_64xN_avx2 1 +INIT_YMM avx2 +cglobal filterPixelToShort_64x%1, 3, 7, 3 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load height + mov r6d, %1/4 + + ; load constant + mova m2, [pw_2000] + +.loop + movu m0, [r0] + movu m1, [r0 + r1] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + + movu m0, [r0 + r1 * 2] + movu m1, [r0 + r5] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 2], m0 + movu [r2 + r4], m1 + + movu m0, [r0 + 32] + movu m1, [r0 + r1 + 32] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 0 + 32], m0 + movu [r2 + r3 * 1 + 32], m1 + + movu m0, [r0 + r1 * 2 + 32] + movu m1, [r0 + r5 + 32] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 2 + 32], m0 + movu [r2 + r4 + 32], m1 + + movu m0, [r0 + 64] + movu m1, [r0 + r1 + 64] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 0 + 64], m0 + movu [r2 + r3 * 1 + 64], m1 + + movu m0, [r0 + r1 * 2 + 64] + movu m1, [r0 + r5 + 64] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 2 + 64], m0 + movu [r2 + r4 + 64], m1 + + movu m0, [r0 + 96] + movu m1, [r0 + r1 + 96] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 0 + 96], m0 + movu [r2 + r3 * 1 + 96], m1 + + movu m0, [r0 + r1 * 2 + 96] + movu m1, [r0 + r5 + 96] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 2 + 96], m0 + movu [r2 + r4 + 96], m1 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET +%endmacro +P2S_H_64xN_avx2 64 +P2S_H_64xN_avx2 16 +P2S_H_64xN_avx2 32 +P2S_H_64xN_avx2 48 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_24xN 1 +INIT_XMM ssse3 +cglobal filterPixelToShort_24x%1, 3, 7, 5 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load height + mov r6d, %1/4 + + ; load constant + mova m4, [pw_2000] + +.loop + movu m0, [r0] + movu m1, [r0 + r1] + movu m2, [r0 + r1 * 2] + movu m3, [r0 + r5] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r4], m3 + + movu m0, [r0 + 16] + movu m1, [r0 + r1 + 16] + movu m2, [r0 + r1 * 2 + 16] + movu m3, [r0 + r5 + 16] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 16], m0 + movu [r2 + r3 * 1 + 16], m1 + movu [r2 + r3 * 2 + 16], m2 + movu [r2 + r4 + 16], m3 + + movu m0, [r0 + 32] + movu m1, [r0 + r1 + 32] + movu m2, [r0 + r1 * 2 + 32] + movu m3, [r0 + r5 + 32] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 32], m0 + movu [r2 + r3 * 1 + 32], m1 + movu [r2 + r3 * 2 + 32], m2 + movu [r2 + r4 + 32], m3 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET +%endmacro +P2S_H_24xN 32 +P2S_H_24xN 64 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_24xN_avx2 1 +INIT_YMM avx2 +cglobal filterPixelToShort_24x%1, 3, 7, 3 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load height + mov r6d, %1/4 + + ; load constant + mova m2, [pw_2000] + +.loop + movu m0, [r0] + movu m1, [r0 + 32] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 0 + 32], xm1 + + movu m0, [r0 + r1] + movu m1, [r0 + r1 + 32] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + movu [r2 + r3 * 1], m0 + movu [r2 + r3 * 1 + 32], xm1 + + movu m0, [r0 + r1 * 2] + movu m1, [r0 + r1 * 2 + 32] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + movu [r2 + r3 * 2], m0 + movu [r2 + r3 * 2 + 32], xm1 + + movu m0, [r0 + r5] + movu m1, [r0 + r5 + 32] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + movu [r2 + r4], m0 + movu [r2 + r4 + 32], xm1 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET +%endmacro +P2S_H_24xN_avx2 32 +P2S_H_24xN_avx2 64 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_12xN 1 +INIT_XMM ssse3 +cglobal filterPixelToShort_12x%1, 3, 7, 3 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load height + mov r6d, %1/4 + + ; load constant + mova m2, [pw_2000] + +.loop + movu m0, [r0] + movu m1, [r0 + r1] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + + movu m0, [r0 + r1 * 2] + movu m1, [r0 + r5] + psllw m0, 4 + psubw m0, m2 + psllw m1, 4 + psubw m1, m2 + + movu [r2 + r3 * 2], m0 + movu [r2 + r4], m1 + + movh m0, [r0 + 16] + movhps m0, [r0 + r1 + 16] + psllw m0, 4 + psubw m0, m2 + + movh [r2 + r3 * 0 + 16], m0 + movhps [r2 + r3 * 1 + 16], m0 + + movh m0, [r0 + r1 * 2 + 16] + movhps m0, [r0 + r5 + 16] + psllw m0, 4 + psubw m0, m2 + + movh [r2 + r3 * 2 + 16], m0 + movhps [r2 + r4 + 16], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET +%endmacro +P2S_H_12xN 16 +P2S_H_12xN 32 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +INIT_XMM ssse3 +cglobal filterPixelToShort_48x64, 3, 7, 5 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load height + mov r6d, 16 + + ; load constant + mova m4, [pw_2000] + +.loop + movu m0, [r0] + movu m1, [r0 + r1] + movu m2, [r0 + r1 * 2] + movu m3, [r0 + r5] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r4], m3 + + movu m0, [r0 + 16] + movu m1, [r0 + r1 + 16] + movu m2, [r0 + r1 * 2 + 16] + movu m3, [r0 + r5 + 16] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 16], m0 + movu [r2 + r3 * 1 + 16], m1 + movu [r2 + r3 * 2 + 16], m2 + movu [r2 + r4 + 16], m3 + + movu m0, [r0 + 32] + movu m1, [r0 + r1 + 32] + movu m2, [r0 + r1 * 2 + 32] + movu m3, [r0 + r5 + 32] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 32], m0 + movu [r2 + r3 * 1 + 32], m1 + movu [r2 + r3 * 2 + 32], m2 + movu [r2 + r4 + 32], m3 + + movu m0, [r0 + 48] + movu m1, [r0 + r1 + 48] + movu m2, [r0 + r1 * 2 + 48] + movu m3, [r0 + r5 + 48] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 48], m0 + movu [r2 + r3 * 1 + 48], m1 + movu [r2 + r3 * 2 + 48], m2 + movu [r2 + r4 + 48], m3 + + movu m0, [r0 + 64] + movu m1, [r0 + r1 + 64] + movu m2, [r0 + r1 * 2 + 64] + movu m3, [r0 + r5 + 64] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 64], m0 + movu [r2 + r3 * 1 + 64], m1 + movu [r2 + r3 * 2 + 64], m2 + movu [r2 + r4 + 64], m3 + + movu m0, [r0 + 80] + movu m1, [r0 + r1 + 80] + movu m2, [r0 + r1 * 2 + 80] + movu m3, [r0 + r5 + 80] + psllw m0, 4 + psubw m0, m4 + psllw m1, 4 + psubw m1, m4 + psllw m2, 4 + psubw m2, m4 + psllw m3, 4 + psubw m3, m4 + + movu [r2 + r3 * 0 + 80], m0 + movu [r2 + r3 * 1 + 80], m1 + movu [r2 + r3 * 2 + 80], m2 + movu [r2 + r4 + 80], m3 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop RET + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal filterPixelToShort_48x64, 3, 7, 4 + add r1d, r1d + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load height + mov r6d, 16 + + ; load constant + mova m3, [pw_2000] + +.loop + movu m0, [r0] + movu m1, [r0 + 32] + movu m2, [r0 + 64] + psllw m0, 4 + psubw m0, m3 + psllw m1, 4 + psubw m1, m3 + psllw m2, 4 + psubw m2, m3 + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 0 + 32], m1 + movu [r2 + r3 * 0 + 64], m2 + + movu m0, [r0 + r1] + movu m1, [r0 + r1 + 32] + movu m2, [r0 + r1 + 64] + psllw m0, 4 + psubw m0, m3 + psllw m1, 4 + psubw m1, m3 + psllw m2, 4 + psubw m2, m3 + movu [r2 + r3 * 1], m0 + movu [r2 + r3 * 1 + 32], m1 + movu [r2 + r3 * 1 + 64], m2 + + movu m0, [r0 + r1 * 2] + movu m1, [r0 + r1 * 2 + 32] + movu m2, [r0 + r1 * 2 + 64] + psllw m0, 4 + psubw m0, m3 + psllw m1, 4 + psubw m1, m3 + psllw m2, 4 + psubw m2, m3 + movu [r2 + r3 * 2], m0 + movu [r2 + r3 * 2 + 32], m1 + movu [r2 + r3 * 2 + 64], m2 + + movu m0, [r0 + r5] + movu m1, [r0 + r5 + 32] + movu m2, [r0 + r5 + 64] + psllw m0, 4 + psubw m0, m3 + psllw m1, 4 + psubw m1, m3 + psllw m2, 4 + psubw m2, m3 + movu [r2 + r4], m0 + movu [r2 + r4 + 32], m1 + movu [r2 + r4 + 64], m2 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET + + +;----------------------------------------------------------------------------------------------------------------------------- +;void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- + +%macro IPFILTER_LUMA_PS_4xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_4x%1, 6,8,7 + mov r5d, r5m + mov r4d, r4m + add r1d, r1d + add r3d, r3d +%ifdef PIC + + lea r6, [tab_LumaCoeff] + lea r4 , [r4 * 8] + vbroadcasti128 m0, [r6 + r4 * 2] + +%else + lea r4 , [r4 * 8] + vbroadcasti128 m0, [tab_LumaCoeff + r4 * 2] +%endif + + vbroadcasti128 m2, [pd_n32768] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - pw_2000 + + sub r0, 6 + test r5d, r5d + mov r7d, %1 ; loop count variable - height + jz .preloop + lea r6, [r1 * 3] ; r6 = (N / 2 - 1) * srcStride + sub r0, r6 ; r0(src) - 3 * srcStride + add r7d, 6 ;7 - 1(since last row not in loop) ; need extra 7 rows, just set a specially flag here, blkheight += N - 1 (7 - 3 = 4 ; since the last three rows not in loop) + +.preloop: + lea r6, [r3 * 3] +.loop + ; Row 0 + movu xm3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + movu xm4, [r0 + 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + vinserti128 m3, m3, xm4, 1 + movu xm4, [r0 + 4] + movu xm5, [r0 + 6] + vinserti128 m4, m4, xm5, 1 + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] + + ; Row 1 + movu xm4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + movu xm5, [r0 + r1 + 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + vinserti128 m4, m4, xm5, 1 + movu xm5, [r0 + r1 + 4] + movu xm6, [r0 + r1 + 6] + vinserti128 m5, m5, xm6, 1 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] + phaddd m3, m4 ; all rows and col completed. + + mova m5, [interp8_hps_shuf] + vpermd m3, m5, m3 + paddd m3, m2 + vextracti128 xm4, m3, 1 + psrad xm3, 2 + psrad xm4, 2 + packssdw xm3, xm3 + packssdw xm4, xm4 + + movq [r2], xm3 ;row 0 + movq [r2 + r3], xm4 ;row 1 + lea r0, [r0 + r1 * 2] ; first loop src ->5th row(i.e 4) + lea r2, [r2 + r3 * 2] ; first loop dst ->5th row(i.e 4) + + sub r7d, 2 + jg .loop + test r5d, r5d + jz .end + + ; Row 10 + movu xm3, [r0] + movu xm4, [r0 + 2] + vinserti128 m3, m3, xm4, 1 + movu xm4, [r0 + 4] + movu xm5, [r0 + 6] + vinserti128 m4, m4, xm5, 1 + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + + ; Row11 + phaddd m3, m4 ; all rows and col completed. + + mova m5, [interp8_hps_shuf] + vpermd m3, m5, m3 + paddd m3, m2 + vextracti128 xm4, m3, 1 + psrad xm3, 2 + psrad xm4, 2 + packssdw xm3, xm3 + packssdw xm4, xm4 + + movq [r2], xm3 ;row 0 +.end + RET +%endif +%endmacro + + IPFILTER_LUMA_PS_4xN_AVX2 4 + IPFILTER_LUMA_PS_4xN_AVX2 8 + IPFILTER_LUMA_PS_4xN_AVX2 16
View file
x265_1.6.tar.gz/source/common/x86/ipfilter8.asm -> x265_1.7.tar.gz/source/common/x86/ipfilter8.asm
Changed
@@ -27,269 +27,269 @@ %include "x86util.asm" SECTION_RODATA 32 -tab_Tm: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6 - db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10 - db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14 +const tab_Tm, db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6 + db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10 + db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14 -ALIGN 32 const interp4_vpp_shuf, times 2 db 0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, 13, 10, 14, 11, 15 -ALIGN 32 const interp_vert_shuf, times 2 db 0, 2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 8, 7, 9 times 2 db 4, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 11, 10, 12, 11, 13 -ALIGN 32 const interp4_vpp_shuf1, dd 0, 1, 1, 2, 2, 3, 3, 4 dd 2, 3, 3, 4, 4, 5, 5, 6 -ALIGN 32 const pb_8tap_hps_0, times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 times 2 db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10 times 2 db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12 times 2 db 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12,12,13,13,14 -ALIGN 32 -tab_Lm: db 0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8 - db 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10 - db 4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12 - db 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14 - -tab_Vm: db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1 - db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3 - -tab_Cm: db 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3 - -tab_c_526336: times 4 dd 8192*64+2048 - -pd_526336: times 8 dd 8192*64+2048 - -tab_ChromaCoeff: db 0, 64, 0, 0 - db -2, 58, 10, -2 - db -4, 54, 16, -2 - db -6, 46, 28, -4 - db -4, 36, 36, -4 - db -4, 28, 46, -6 - db -2, 16, 54, -4 - db -2, 10, 58, -2 -ALIGN 32 -tab_ChromaCoeff_V: times 8 db 0, 64 - times 8 db 0, 0 +const tab_Lm, db 0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8 + db 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10 + db 4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12 + db 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14 - times 8 db -2, 58 - times 8 db 10, -2 +const tab_Vm, db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1 + db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3 - times 8 db -4, 54 - times 8 db 16, -2 +const tab_Cm, db 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3 - times 8 db -6, 46 - times 8 db 28, -4 +const pd_526336, times 8 dd 8192*64+2048 - times 8 db -4, 36 - times 8 db 36, -4 +const tab_ChromaCoeff, db 0, 64, 0, 0 + db -2, 58, 10, -2 + db -4, 54, 16, -2 + db -6, 46, 28, -4 + db -4, 36, 36, -4 + db -4, 28, 46, -6 + db -2, 16, 54, -4 + db -2, 10, 58, -2 - times 8 db -4, 28 - times 8 db 46, -6 +const tabw_ChromaCoeff, dw 0, 64, 0, 0 + dw -2, 58, 10, -2 + dw -4, 54, 16, -2 + dw -6, 46, 28, -4 + dw -4, 36, 36, -4 + dw -4, 28, 46, -6 + dw -2, 16, 54, -4 + dw -2, 10, 58, -2 - times 8 db -2, 16 - times 8 db 54, -4 +const tab_ChromaCoeff_V, times 8 db 0, 64 + times 8 db 0, 0 - times 8 db -2, 10 - times 8 db 58, -2 + times 8 db -2, 58 + times 8 db 10, -2 -tab_ChromaCoeffV: times 4 dw 0, 64 - times 4 dw 0, 0 + times 8 db -4, 54 + times 8 db 16, -2 - times 4 dw -2, 58 - times 4 dw 10, -2 + times 8 db -6, 46 + times 8 db 28, -4 - times 4 dw -4, 54 - times 4 dw 16, -2 + times 8 db -4, 36 + times 8 db 36, -4 - times 4 dw -6, 46 - times 4 dw 28, -4 + times 8 db -4, 28 + times 8 db 46, -6 - times 4 dw -4, 36 - times 4 dw 36, -4 + times 8 db -2, 16 + times 8 db 54, -4 - times 4 dw -4, 28 - times 4 dw 46, -6 + times 8 db -2, 10 + times 8 db 58, -2 - times 4 dw -2, 16 - times 4 dw 54, -4 +const tab_ChromaCoeffV, times 4 dw 0, 64 + times 4 dw 0, 0 - times 4 dw -2, 10 - times 4 dw 58, -2 + times 4 dw -2, 58 + times 4 dw 10, -2 -ALIGN 32 -pw_ChromaCoeffV: times 8 dw 0, 64 - times 8 dw 0, 0 + times 4 dw -4, 54 + times 4 dw 16, -2 - times 8 dw -2, 58 - times 8 dw 10, -2 + times 4 dw -6, 46 + times 4 dw 28, -4 - times 8 dw -4, 54 - times 8 dw 16, -2 + times 4 dw -4, 36 + times 4 dw 36, -4 - times 8 dw -6, 46 - times 8 dw 28, -4 - - times 8 dw -4, 36 - times 8 dw 36, -4 - - times 8 dw -4, 28 - times 8 dw 46, -6 - - times 8 dw -2, 16 - times 8 dw 54, -4 - - times 8 dw -2, 10 - times 8 dw 58, -2 - -tab_LumaCoeff: db 0, 0, 0, 64, 0, 0, 0, 0 - db -1, 4, -10, 58, 17, -5, 1, 0 - db -1, 4, -11, 40, 40, -11, 4, -1 - db 0, 1, -5, 17, 58, -10, 4, -1 - -tab_LumaCoeffV: times 4 dw 0, 0 - times 4 dw 0, 64 - times 4 dw 0, 0 - times 4 dw 0, 0 - - times 4 dw -1, 4 - times 4 dw -10, 58 - times 4 dw 17, -5 - times 4 dw 1, 0 - - times 4 dw -1, 4 - times 4 dw -11, 40 - times 4 dw 40, -11 - times 4 dw 4, -1 - - times 4 dw 0, 1 - times 4 dw -5, 17 - times 4 dw 58, -10 - times 4 dw 4, -1 + times 4 dw -4, 28 + times 4 dw 46, -6 -ALIGN 32 -pw_LumaCoeffVer: times 8 dw 0, 0 - times 8 dw 0, 64 - times 8 dw 0, 0 - times 8 dw 0, 0 - - times 8 dw -1, 4 - times 8 dw -10, 58 - times 8 dw 17, -5 - times 8 dw 1, 0 - - times 8 dw -1, 4 - times 8 dw -11, 40 - times 8 dw 40, -11 - times 8 dw 4, -1 - - times 8 dw 0, 1 - times 8 dw -5, 17 - times 8 dw 58, -10 - times 8 dw 4, -1 - -pb_LumaCoeffVer: times 16 db 0, 0 - times 16 db 0, 64 - times 16 db 0, 0 - times 16 db 0, 0 - - times 16 db -1, 4 - times 16 db -10, 58 - times 16 db 17, -5 - times 16 db 1, 0 - - times 16 db -1, 4 - times 16 db -11, 40 - times 16 db 40, -11 - times 16 db 4, -1 - - times 16 db 0, 1 - times 16 db -5, 17 - times 16 db 58, -10 - times 16 db 4, -1 - -tab_LumaCoeffVer: times 8 db 0, 0 - times 8 db 0, 64 - times 8 db 0, 0 - times 8 db 0, 0 - - times 8 db -1, 4 - times 8 db -10, 58 - times 8 db 17, -5 - times 8 db 1, 0 - - times 8 db -1, 4 - times 8 db -11, 40 - times 8 db 40, -11 - times 8 db 4, -1 - - times 8 db 0, 1 - times 8 db -5, 17 - times 8 db 58, -10 - times 8 db 4, -1 + times 4 dw -2, 16 + times 4 dw 54, -4 -ALIGN 32 -tab_LumaCoeffVer_32: times 16 db 0, 0 - times 16 db 0, 64 - times 16 db 0, 0 - times 16 db 0, 0 - - times 16 db -1, 4 - times 16 db -10, 58 - times 16 db 17, -5 - times 16 db 1, 0 - - times 16 db -1, 4 - times 16 db -11, 40 - times 16 db 40, -11 - times 16 db 4, -1 - - times 16 db 0, 1 - times 16 db -5, 17 - times 16 db 58, -10 - times 16 db 4, -1 + times 4 dw -2, 10 + times 4 dw 58, -2 -ALIGN 32 -tab_ChromaCoeffVer_32: times 16 db 0, 64 - times 16 db 0, 0 +const pw_ChromaCoeffV, times 8 dw 0, 64 + times 8 dw 0, 0 + + times 8 dw -2, 58 + times 8 dw 10, -2 + + times 8 dw -4, 54 + times 8 dw 16, -2 + + times 8 dw -6, 46 + times 8 dw 28, -4 + + times 8 dw -4, 36 + times 8 dw 36, -4 + + times 8 dw -4, 28 + times 8 dw 46, -6 + + times 8 dw -2, 16 + times 8 dw 54, -4 + + times 8 dw -2, 10 + times 8 dw 58, -2 + +const tab_LumaCoeff, db 0, 0, 0, 64, 0, 0, 0, 0 + db -1, 4, -10, 58, 17, -5, 1, 0 + db -1, 4, -11, 40, 40, -11, 4, -1 + db 0, 1, -5, 17, 58, -10, 4, -1 + +const tabw_LumaCoeff, dw 0, 0, 0, 64, 0, 0, 0, 0 + dw -1, 4, -10, 58, 17, -5, 1, 0 + dw -1, 4, -11, 40, 40, -11, 4, -1 + dw 0, 1, -5, 17, 58, -10, 4, -1 + +const tab_LumaCoeffV, times 4 dw 0, 0 + times 4 dw 0, 64 + times 4 dw 0, 0 + times 4 dw 0, 0 + + times 4 dw -1, 4 + times 4 dw -10, 58 + times 4 dw 17, -5 + times 4 dw 1, 0 + + times 4 dw -1, 4 + times 4 dw -11, 40 + times 4 dw 40, -11 + times 4 dw 4, -1 + + times 4 dw 0, 1 + times 4 dw -5, 17 + times 4 dw 58, -10 + times 4 dw 4, -1 + +const pw_LumaCoeffVer, times 8 dw 0, 0 + times 8 dw 0, 64 + times 8 dw 0, 0 + times 8 dw 0, 0 + + times 8 dw -1, 4 + times 8 dw -10, 58 + times 8 dw 17, -5 + times 8 dw 1, 0 - times 16 db -2, 58 - times 16 db 10, -2 + times 8 dw -1, 4 + times 8 dw -11, 40 + times 8 dw 40, -11 + times 8 dw 4, -1 - times 16 db -4, 54 - times 16 db 16, -2 + times 8 dw 0, 1 + times 8 dw -5, 17 + times 8 dw 58, -10 + times 8 dw 4, -1 - times 16 db -6, 46 - times 16 db 28, -4 +const pb_LumaCoeffVer, times 16 db 0, 0 + times 16 db 0, 64 + times 16 db 0, 0 + times 16 db 0, 0 - times 16 db -4, 36 - times 16 db 36, -4 + times 16 db -1, 4 + times 16 db -10, 58 + times 16 db 17, -5 + times 16 db 1, 0 - times 16 db -4, 28 - times 16 db 46, -6 + times 16 db -1, 4 + times 16 db -11, 40 + times 16 db 40, -11 + times 16 db 4, -1 - times 16 db -2, 16 - times 16 db 54, -4 + times 16 db 0, 1 + times 16 db -5, 17 + times 16 db 58, -10 + times 16 db 4, -1 - times 16 db -2, 10 - times 16 db 58, -2 +const tab_LumaCoeffVer, times 8 db 0, 0 + times 8 db 0, 64 + times 8 db 0, 0 + times 8 db 0, 0 -tab_c_64_n64: times 8 db 64, -64 + times 8 db -1, 4 + times 8 db -10, 58 + times 8 db 17, -5 + times 8 db 1, 0 + + times 8 db -1, 4 + times 8 db -11, 40 + times 8 db 40, -11 + times 8 db 4, -1 + + times 8 db 0, 1 + times 8 db -5, 17 + times 8 db 58, -10 + times 8 db 4, -1 + +const tab_LumaCoeffVer_32, times 16 db 0, 0 + times 16 db 0, 64 + times 16 db 0, 0 + times 16 db 0, 0 + + times 16 db -1, 4 + times 16 db -10, 58 + times 16 db 17, -5 + times 16 db 1, 0 + + times 16 db -1, 4 + times 16 db -11, 40 + times 16 db 40, -11 + times 16 db 4, -1 + + times 16 db 0, 1 + times 16 db -5, 17 + times 16 db 58, -10 + times 16 db 4, -1 + +const tab_ChromaCoeffVer_32, times 16 db 0, 64 + times 16 db 0, 0 + + times 16 db -2, 58 + times 16 db 10, -2 + + times 16 db -4, 54 + times 16 db 16, -2 + + times 16 db -6, 46 + times 16 db 28, -4 + + times 16 db -4, 36 + times 16 db 36, -4 + + times 16 db -4, 28 + times 16 db 46, -6 + + times 16 db -2, 16 + times 16 db 54, -4 + + times 16 db -2, 10 + times 16 db 58, -2 + +const tab_c_64_n64, times 8 db 64, -64 const interp4_shuf, times 2 db 0, 1, 8, 9, 4, 5, 12, 13, 2, 3, 10, 11, 6, 7, 14, 15 -ALIGN 32 -interp4_horiz_shuf1: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6 - db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 +const interp4_horiz_shuf1, db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6 + db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 -ALIGN 32 -interp4_hpp_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12 +const interp4_hpp_shuf, times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12 -ALIGN 32 -interp8_hps_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7 +const interp8_hps_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7 ALIGN 32 interp4_hps_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12 @@ -298,9 +298,276 @@ cextern pb_128 cextern pw_1 +cextern pw_32 cextern pw_512 cextern pw_2000 +%macro FILTER_H4_w2_2_sse2 0 + pxor m3, m3 + movd m0, [srcq - 1] + movd m2, [srcq] + punpckldq m0, m2 + punpcklbw m0, m3 + movd m1, [srcq + srcstrideq - 1] + movd m2, [srcq + srcstrideq] + punpckldq m1, m2 + punpcklbw m1, m3 + pmaddwd m0, m4 + pmaddwd m1, m4 + packssdw m0, m1 + pshuflw m1, m0, q2301 + pshufhw m1, m1, q2301 + paddw m0, m1 + psrld m0, 16 + packssdw m0, m0 + paddw m0, m5 + psraw m0, 6 + packuswb m0, m0 + movd r4, m0 + mov [dstq], r4w + shr r4, 16 + mov [dstq + dststrideq], r4w +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse3 +cglobal interp_4tap_horiz_pp_2x4, 4, 6, 6, src, srcstride, dst, dststride + mov r4d, r4m + mova m5, [pw_32] + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movddup m4, [r5 + r4 * 8] +%else + movddup m4, [tabw_ChromaCoeff + r4 * 8] +%endif + + FILTER_H4_w2_2_sse2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] + FILTER_H4_w2_2_sse2 + + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse3 +cglobal interp_4tap_horiz_pp_2x8, 4, 6, 6, src, srcstride, dst, dststride + mov r4d, r4m + mova m5, [pw_32] + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movddup m4, [r5 + r4 * 8] +%else + movddup m4, [tabw_ChromaCoeff + r4 * 8] +%endif + +%assign x 1 +%rep 4 + FILTER_H4_w2_2_sse2 +%if x < 4 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] +%endif +%assign x x+1 +%endrep + + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_2x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse3 +cglobal interp_4tap_horiz_pp_2x16, 4, 6, 6, src, srcstride, dst, dststride + mov r4d, r4m + mova m5, [pw_32] + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movddup m4, [r5 + r4 * 8] +%else + movddup m4, [tabw_ChromaCoeff + r4 * 8] +%endif + +%assign x 1 +%rep 8 + FILTER_H4_w2_2_sse2 +%if x < 8 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] +%endif +%assign x x+1 +%endrep + + RET + +%macro FILTER_H4_w4_2_sse2 0 + pxor m5, m5 + movd m0, [srcq - 1] + movd m6, [srcq] + punpckldq m0, m6 + punpcklbw m0, m5 + movd m1, [srcq + 1] + movd m6, [srcq + 2] + punpckldq m1, m6 + punpcklbw m1, m5 + movd m2, [srcq + srcstrideq - 1] + movd m6, [srcq + srcstrideq] + punpckldq m2, m6 + punpcklbw m2, m5 + movd m3, [srcq + srcstrideq + 1] + movd m6, [srcq + srcstrideq + 2] + punpckldq m3, m6 + punpcklbw m3, m5 + pmaddwd m0, m4 + pmaddwd m1, m4 + pmaddwd m2, m4 + pmaddwd m3, m4 + packssdw m0, m1 + packssdw m2, m3 + pshuflw m1, m0, q2301 + pshufhw m1, m1, q2301 + pshuflw m3, m2, q2301 + pshufhw m3, m3, q2301 + paddw m0, m1 + paddw m2, m3 + psrld m0, 16 + psrld m2, 16 + packssdw m0, m2 + paddw m0, m7 + psraw m0, 6 + packuswb m0, m2 + movd [dstq], m0 + psrldq m0, 4 + movd [dstq + dststrideq], m0 +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse3 +cglobal interp_4tap_horiz_pp_4x2, 4, 6, 8, src, srcstride, dst, dststride + mov r4d, r4m + mova m7, [pw_32] + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movddup m4, [r5 + r4 * 8] +%else + movddup m4, [tabw_ChromaCoeff + r4 * 8] +%endif + + FILTER_H4_w4_2_sse2 + + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse3 +cglobal interp_4tap_horiz_pp_4x4, 4, 6, 8, src, srcstride, dst, dststride + mov r4d, r4m + mova m7, [pw_32] + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movddup m4, [r5 + r4 * 8] +%else + movddup m4, [tabw_ChromaCoeff + r4 * 8] +%endif + + FILTER_H4_w4_2_sse2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] + FILTER_H4_w4_2_sse2 + + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse3 +cglobal interp_4tap_horiz_pp_4x8, 4, 6, 8, src, srcstride, dst, dststride + mov r4d, r4m + mova m7, [pw_32] + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movddup m4, [r5 + r4 * 8] +%else + movddup m4, [tabw_ChromaCoeff + r4 * 8] +%endif + +%assign x 1 +%rep 4 + FILTER_H4_w4_2_sse2 +%if x < 4 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] +%endif +%assign x x+1 +%endrep + + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse3 +cglobal interp_4tap_horiz_pp_4x16, 4, 6, 8, src, srcstride, dst, dststride + mov r4d, r4m + mova m7, [pw_32] + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movddup m4, [r5 + r4 * 8] +%else + movddup m4, [tabw_ChromaCoeff + r4 * 8] +%endif + +%assign x 1 +%rep 8 + FILTER_H4_w4_2_sse2 +%if x < 8 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] +%endif +%assign x x+1 +%endrep + + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_4x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse3 +cglobal interp_4tap_horiz_pp_4x32, 4, 6, 8, src, srcstride, dst, dststride + mov r4d, r4m + mova m7, [pw_32] + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movddup m4, [r5 + r4 * 8] +%else + movddup m4, [tabw_ChromaCoeff + r4 * 8] +%endif + +%assign x 1 +%rep 16 + FILTER_H4_w4_2_sse2 +%if x < 16 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] +%endif +%assign x x+1 +%endrep + + RET + %macro FILTER_H4_w2_2 3 movh %2, [srcq - 1] pshufb %2, %2, Tm0 @@ -317,6 +584,1298 @@ mov [dstq + dststrideq], r4w %endmacro +%macro FILTER_H4_w6_sse2 0 + pxor m4, m4 + movh m0, [srcq - 1] + movh m5, [srcq] + punpckldq m0, m5 + movhlps m2, m0 + punpcklbw m0, m4 + punpcklbw m2, m4 + movd m1, [srcq + 1] + movd m5, [srcq + 2] + punpckldq m1, m5 + punpcklbw m1, m4 + pmaddwd m0, m6 + pmaddwd m1, m6 + pmaddwd m2, m6 + packssdw m0, m1 + packssdw m2, m2 + pshuflw m1, m0, q2301 + pshufhw m1, m1, q2301 + pshuflw m3, m2, q2301 + paddw m0, m1 + paddw m2, m3 + psrld m0, 16 + psrld m2, 16 + packssdw m0, m2 + paddw m0, m7 + psraw m0, 6 + packuswb m0, m0 + movd [dstq], m0 + pextrw r4d, m0, 2 + mov [dstq + 4], r4w +%endmacro + +%macro FILH4W8_sse2 1 + movh m0, [srcq - 1 + %1] + movh m5, [srcq + %1] + punpckldq m0, m5 + movhlps m2, m0 + punpcklbw m0, m4 + punpcklbw m2, m4 + movh m1, [srcq + 1 + %1] + movh m5, [srcq + 2 + %1] + punpckldq m1, m5 + movhlps m3, m1 + punpcklbw m1, m4 + punpcklbw m3, m4 + pmaddwd m0, m6 + pmaddwd m1, m6 + pmaddwd m2, m6 + pmaddwd m3, m6 + packssdw m0, m1 + packssdw m2, m3 + pshuflw m1, m0, q2301 + pshufhw m1, m1, q2301 + pshuflw m3, m2, q2301 + pshufhw m3, m3, q2301 + paddw m0, m1 + paddw m2, m3 + psrld m0, 16 + psrld m2, 16 + packssdw m0, m2 + paddw m0, m7 + psraw m0, 6 + packuswb m0, m0 + movh [dstq + %1], m0 +%endmacro + +%macro FILTER_H4_w8_sse2 0 + FILH4W8_sse2 0 +%endmacro + +%macro FILTER_H4_w12_sse2 0 + FILH4W8_sse2 0 + movd m1, [srcq - 1 + 8] + movd m3, [srcq + 8] + punpckldq m1, m3 + punpcklbw m1, m4 + movd m2, [srcq + 1 + 8] + movd m3, [srcq + 2 + 8] + punpckldq m2, m3 + punpcklbw m2, m4 + pmaddwd m1, m6 + pmaddwd m2, m6 + packssdw m1, m2 + pshuflw m2, m1, q2301 + pshufhw m2, m2, q2301 + paddw m1, m2 + psrld m1, 16 + packssdw m1, m1 + paddw m1, m7 + psraw m1, 6 + packuswb m1, m1 + movd [dstq + 8], m1 +%endmacro + +%macro FILTER_H4_w16_sse2 0 + FILH4W8_sse2 0 + FILH4W8_sse2 8 +%endmacro + +%macro FILTER_H4_w24_sse2 0 + FILH4W8_sse2 0 + FILH4W8_sse2 8 + FILH4W8_sse2 16 +%endmacro + +%macro FILTER_H4_w32_sse2 0 + FILH4W8_sse2 0 + FILH4W8_sse2 8 + FILH4W8_sse2 16 + FILH4W8_sse2 24 +%endmacro + +%macro FILTER_H4_w48_sse2 0 + FILH4W8_sse2 0 + FILH4W8_sse2 8 + FILH4W8_sse2 16 + FILH4W8_sse2 24 + FILH4W8_sse2 32 + FILH4W8_sse2 40 +%endmacro + +%macro FILTER_H4_w64_sse2 0 + FILH4W8_sse2 0 + FILH4W8_sse2 8 + FILH4W8_sse2 16 + FILH4W8_sse2 24 + FILH4W8_sse2 32 + FILH4W8_sse2 40 + FILH4W8_sse2 48 + FILH4W8_sse2 56 +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_sse3 2 +INIT_XMM sse3 +cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 8, src, srcstride, dst, dststride + mov r4d, r4m + mova m7, [pw_32] + pxor m4, m4 + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movddup m6, [r5 + r4 * 8] +%else + movddup m6, [tabw_ChromaCoeff + r4 * 8] +%endif + +%assign x 1 +%rep %2 + FILTER_H4_w%1_sse2 +%if x < %2 + add srcq, srcstrideq + add dstq, dststrideq +%endif +%assign x x+1 +%endrep + + RET + +%endmacro + + IPFILTER_CHROMA_sse3 6, 8 + IPFILTER_CHROMA_sse3 8, 2 + IPFILTER_CHROMA_sse3 8, 4 + IPFILTER_CHROMA_sse3 8, 6 + IPFILTER_CHROMA_sse3 8, 8 + IPFILTER_CHROMA_sse3 8, 16 + IPFILTER_CHROMA_sse3 8, 32 + IPFILTER_CHROMA_sse3 12, 16 + + IPFILTER_CHROMA_sse3 6, 16 + IPFILTER_CHROMA_sse3 8, 12 + IPFILTER_CHROMA_sse3 8, 64 + IPFILTER_CHROMA_sse3 12, 32 + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_W_sse3 2 +INIT_XMM sse3 +cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 8, src, srcstride, dst, dststride + mov r4d, r4m + mova m7, [pw_32] + pxor m4, m4 +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movddup m6, [r5 + r4 * 8] +%else + movddup m6, [tabw_ChromaCoeff + r4 * 8] +%endif + +%assign x 1 +%rep %2 + FILTER_H4_w%1_sse2 +%if x < %2 + add srcq, srcstrideq + add dstq, dststrideq +%endif +%assign x x+1 +%endrep + + RET + +%endmacro + + IPFILTER_CHROMA_W_sse3 16, 4 + IPFILTER_CHROMA_W_sse3 16, 8 + IPFILTER_CHROMA_W_sse3 16, 12 + IPFILTER_CHROMA_W_sse3 16, 16 + IPFILTER_CHROMA_W_sse3 16, 32 + IPFILTER_CHROMA_W_sse3 32, 8 + IPFILTER_CHROMA_W_sse3 32, 16 + IPFILTER_CHROMA_W_sse3 32, 24 + IPFILTER_CHROMA_W_sse3 24, 32 + IPFILTER_CHROMA_W_sse3 32, 32 + + IPFILTER_CHROMA_W_sse3 16, 24 + IPFILTER_CHROMA_W_sse3 16, 64 + IPFILTER_CHROMA_W_sse3 32, 48 + IPFILTER_CHROMA_W_sse3 24, 64 + IPFILTER_CHROMA_W_sse3 32, 64 + + IPFILTER_CHROMA_W_sse3 64, 64 + IPFILTER_CHROMA_W_sse3 64, 32 + IPFILTER_CHROMA_W_sse3 64, 48 + IPFILTER_CHROMA_W_sse3 48, 64 + IPFILTER_CHROMA_W_sse3 64, 16 + +%macro FILTER_H8_W8_sse2 0 + movh m1, [r0 + x - 3] + movh m4, [r0 + x - 2] + punpcklbw m1, m6 + punpcklbw m4, m6 + movh m5, [r0 + x - 1] + movh m0, [r0 + x] + punpcklbw m5, m6 + punpcklbw m0, m6 + pmaddwd m1, m3 + pmaddwd m4, m3 + pmaddwd m5, m3 + pmaddwd m0, m3 + packssdw m1, m4 + packssdw m5, m0 + pshuflw m4, m1, q2301 + pshufhw m4, m4, q2301 + pshuflw m0, m5, q2301 + pshufhw m0, m0, q2301 + paddw m1, m4 + paddw m5, m0 + psrldq m1, 2 + psrldq m5, 2 + pshufd m1, m1, q3120 + pshufd m5, m5, q3120 + punpcklqdq m1, m5 + movh m7, [r0 + x + 1] + movh m4, [r0 + x + 2] + punpcklbw m7, m6 + punpcklbw m4, m6 + movh m5, [r0 + x + 3] + movh m0, [r0 + x + 4] + punpcklbw m5, m6 + punpcklbw m0, m6 + pmaddwd m7, m3 + pmaddwd m4, m3 + pmaddwd m5, m3 + pmaddwd m0, m3 + packssdw m7, m4 + packssdw m5, m0 + pshuflw m4, m7, q2301 + pshufhw m4, m4, q2301 + pshuflw m0, m5, q2301 + pshufhw m0, m0, q2301 + paddw m7, m4 + paddw m5, m0 + psrldq m7, 2 + psrldq m5, 2 + pshufd m7, m7, q3120 + pshufd m5, m5, q3120 + punpcklqdq m7, m5 + pshuflw m4, m1, q2301 + pshufhw m4, m4, q2301 + pshuflw m0, m7, q2301 + pshufhw m0, m0, q2301 + paddw m1, m4 + paddw m7, m0 + psrldq m1, 2 + psrldq m7, 2 + pshufd m1, m1, q3120 + pshufd m7, m7, q3120 + punpcklqdq m1, m7 +%endmacro + +%macro FILTER_H8_W4_sse2 0 + movh m1, [r0 + x - 3] + movh m0, [r0 + x - 2] + punpcklbw m1, m6 + punpcklbw m0, m6 + movh m4, [r0 + x - 1] + movh m5, [r0 + x] + punpcklbw m4, m6 + punpcklbw m5, m6 + pmaddwd m1, m3 + pmaddwd m0, m3 + pmaddwd m4, m3 + pmaddwd m5, m3 + packssdw m1, m0 + packssdw m4, m5 + pshuflw m0, m1, q2301 + pshufhw m0, m0, q2301 + pshuflw m5, m4, q2301 + pshufhw m5, m5, q2301 + paddw m1, m0 + paddw m4, m5 + psrldq m1, 2 + psrldq m4, 2 + pshufd m1, m1, q3120 + pshufd m4, m4, q3120 + punpcklqdq m1, m4 + pshuflw m0, m1, q2301 + pshufhw m0, m0, q2301 + paddw m1, m0 + psrldq m1, 2 + pshufd m1, m1, q3120 +%endmacro + +;---------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;---------------------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_LUMA_sse2 3 +INIT_XMM sse2 +cglobal interp_8tap_horiz_%3_%1x%2, 4,6,8 + mov r4d, r4m + add r4d, r4d + pxor m6, m6 + +%ifidn %3, ps + add r3d, r3d + cmp r5m, byte 0 +%endif + +%ifdef PIC + lea r5, [tabw_LumaCoeff] + movu m3, [r5 + r4 * 8] +%else + movu m3, [tabw_LumaCoeff + r4 * 8] +%endif + + mov r4d, %2 + +%ifidn %3, pp + mova m2, [pw_32] +%else + mova m2, [pw_2000] + je .loopH + lea r5, [r1 + 2 * r1] + sub r0, r5 + add r4d, 7 +%endif + +.loopH: +%assign x 0 +%rep %1 / 8 + FILTER_H8_W8_sse2 + %ifidn %3, pp + paddw m1, m2 + psraw m1, 6 + packuswb m1, m1 + movh [r2 + x], m1 + %else + psubw m1, m2 + movu [r2 + 2 * x], m1 + %endif +%assign x x+8 +%endrep + +%rep (%1 % 8) / 4 + FILTER_H8_W4_sse2 + %ifidn %3, pp + paddw m1, m2 + psraw m1, 6 + packuswb m1, m1 + movd [r2 + x], m1 + %else + psubw m1, m2 + movh [r2 + 2 * x], m1 + %endif +%endrep + + add r0, r1 + add r2, r3 + + dec r4d + jnz .loopH + RET + +%endmacro + +;-------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;-------------------------------------------------------------------------------------------------------------- + IPFILTER_LUMA_sse2 4, 4, pp + IPFILTER_LUMA_sse2 4, 8, pp + IPFILTER_LUMA_sse2 8, 4, pp + IPFILTER_LUMA_sse2 8, 8, pp + IPFILTER_LUMA_sse2 16, 16, pp + IPFILTER_LUMA_sse2 16, 8, pp + IPFILTER_LUMA_sse2 8, 16, pp + IPFILTER_LUMA_sse2 16, 12, pp + IPFILTER_LUMA_sse2 12, 16, pp + IPFILTER_LUMA_sse2 16, 4, pp + IPFILTER_LUMA_sse2 4, 16, pp + IPFILTER_LUMA_sse2 32, 32, pp + IPFILTER_LUMA_sse2 32, 16, pp + IPFILTER_LUMA_sse2 16, 32, pp + IPFILTER_LUMA_sse2 32, 24, pp + IPFILTER_LUMA_sse2 24, 32, pp + IPFILTER_LUMA_sse2 32, 8, pp + IPFILTER_LUMA_sse2 8, 32, pp + IPFILTER_LUMA_sse2 64, 64, pp + IPFILTER_LUMA_sse2 64, 32, pp + IPFILTER_LUMA_sse2 32, 64, pp + IPFILTER_LUMA_sse2 64, 48, pp + IPFILTER_LUMA_sse2 48, 64, pp + IPFILTER_LUMA_sse2 64, 16, pp + IPFILTER_LUMA_sse2 16, 64, pp + +;---------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;---------------------------------------------------------------------------------------------------------------------------- + IPFILTER_LUMA_sse2 4, 4, ps + IPFILTER_LUMA_sse2 8, 8, ps + IPFILTER_LUMA_sse2 8, 4, ps + IPFILTER_LUMA_sse2 4, 8, ps + IPFILTER_LUMA_sse2 16, 16, ps + IPFILTER_LUMA_sse2 16, 8, ps + IPFILTER_LUMA_sse2 8, 16, ps + IPFILTER_LUMA_sse2 16, 12, ps + IPFILTER_LUMA_sse2 12, 16, ps + IPFILTER_LUMA_sse2 16, 4, ps + IPFILTER_LUMA_sse2 4, 16, ps + IPFILTER_LUMA_sse2 32, 32, ps + IPFILTER_LUMA_sse2 32, 16, ps + IPFILTER_LUMA_sse2 16, 32, ps + IPFILTER_LUMA_sse2 32, 24, ps + IPFILTER_LUMA_sse2 24, 32, ps + IPFILTER_LUMA_sse2 32, 8, ps + IPFILTER_LUMA_sse2 8, 32, ps + IPFILTER_LUMA_sse2 64, 64, ps + IPFILTER_LUMA_sse2 64, 32, ps + IPFILTER_LUMA_sse2 32, 64, ps + IPFILTER_LUMA_sse2 64, 48, ps + IPFILTER_LUMA_sse2 48, 64, ps + IPFILTER_LUMA_sse2 64, 16, ps + IPFILTER_LUMA_sse2 16, 64, ps + +%macro WORD_TO_DOUBLE 1 +%if ARCH_X86_64 + punpcklbw %1, m8 +%else + punpcklbw %1, %1 + psrlw %1, 8 +%endif +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_2xn(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W2_H4_sse2 1 +INIT_XMM sse2 +%if ARCH_X86_64 +cglobal interp_4tap_vert_pp_2x%1, 4, 6, 9 + pxor m8, m8 +%else +cglobal interp_4tap_vert_pp_2x%1, 4, 6, 8 +%endif + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movh m0, [r5 + r4 * 8] +%else + movh m0, [tabw_ChromaCoeff + r4 * 8] +%endif + + punpcklqdq m0, m0 + mova m1, [pw_32] + lea r5, [3 * r1] + +%assign x 1 +%rep %1/4 + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r5] + + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklwd m2, m6 + + WORD_TO_DOUBLE m2 + pmaddwd m2, m0 + + lea r0, [r0 + 4 * r1] + movd m6, [r0] + + punpcklbw m3, m4 + punpcklbw m7, m5, m6 + punpcklwd m3, m7 + + WORD_TO_DOUBLE m3 + pmaddwd m3, m0 + + packssdw m2, m3 + pshuflw m3, m2, q2301 + pshufhw m3, m3, q2301 + paddw m2, m3 + psrld m2, 16 + + movd m7, [r0 + r1] + + punpcklbw m4, m5 + punpcklbw m3, m6, m7 + punpcklwd m4, m3 + + WORD_TO_DOUBLE m4 + pmaddwd m4, m0 + + movd m3, [r0 + 2 * r1] + + punpcklbw m5, m6 + punpcklbw m7, m3 + punpcklwd m5, m7 + + WORD_TO_DOUBLE m5 + pmaddwd m5, m0 + + packssdw m4, m5 + pshuflw m5, m4, q2301 + pshufhw m5, m5, q2301 + paddw m4, m5 + psrld m4, 16 + + packssdw m2, m4 + paddw m2, m1 + psraw m2, 6 + packuswb m2, m2 + +%if ARCH_X86_64 + movq r4, m2 + mov [r2], r4w + shr r4, 16 + mov [r2 + r3], r4w + lea r2, [r2 + 2 * r3] + shr r4, 16 + mov [r2], r4w + shr r4, 16 + mov [r2 + r3], r4w +%else + movd r4, m2 + mov [r2], r4w + shr r4, 16 + mov [r2 + r3], r4w + lea r2, [r2 + 2 * r3] + psrldq m2, 4 + movd r4, m2 + mov [r2], r4w + shr r4, 16 + mov [r2 + r3], r4w +%endif + +%if x < %1/4 + lea r2, [r2 + 2 * r3] +%endif +%assign x x+1 +%endrep + RET + +%endmacro + + FILTER_V4_W2_H4_sse2 4 + FILTER_V4_W2_H4_sse2 8 + FILTER_V4_W2_H4_sse2 16 + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +INIT_XMM sse2 +cglobal interp_4tap_vert_pp_4x2, 4, 6, 8 + + mov r4d, r4m + sub r0, r1 + pxor m7, m7 + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movh m0, [r5 + r4 * 8] +%else + movh m0, [tabw_ChromaCoeff + r4 * 8] +%endif + + lea r5, [r0 + 2 * r1] + punpcklqdq m0, m0 + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r5] + movd m5, [r5 + r1] + + punpcklbw m2, m3 + punpcklbw m1, m4, m5 + punpcklwd m2, m1 + + movhlps m6, m2 + punpcklbw m2, m7 + punpcklbw m6, m7 + pmaddwd m2, m0 + pmaddwd m6, m0 + packssdw m2, m6 + + movd m1, [r0 + 4 * r1] + + punpcklbw m3, m4 + punpcklbw m5, m1 + punpcklwd m3, m5 + + movhlps m6, m3 + punpcklbw m3, m7 + punpcklbw m6, m7 + pmaddwd m3, m0 + pmaddwd m6, m0 + packssdw m3, m6 + + pshuflw m4, m2, q2301 + pshufhw m4, m4, q2301 + paddw m2, m4 + pshuflw m5, m3, q2301 + pshufhw m5, m5, q2301 + paddw m3, m5 + psrld m2, 16 + psrld m3, 16 + packssdw m2, m3 + + paddw m2, [pw_32] + psraw m2, 6 + packuswb m2, m2 + + movd [r2], m2 + psrldq m2, 4 + movd [r2 + r3], m2 + RET + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W4_H4_sse2 1 +INIT_XMM sse2 +%if ARCH_X86_64 +cglobal interp_4tap_vert_pp_4x%1, 4, 6, 9 + pxor m8, m8 +%else +cglobal interp_4tap_vert_pp_4x%1, 4, 6, 8 +%endif + + mov r4d, r4m + sub r0, r1 + +%ifdef PIC + lea r5, [tabw_ChromaCoeff] + movh m0, [r5 + r4 * 8] +%else + movh m0, [tabw_ChromaCoeff + r4 * 8] +%endif + + mova m1, [pw_32] + lea r5, [3 * r1] + punpcklqdq m0, m0 + +%assign x 1 +%rep %1/4 + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r5] + + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklwd m2, m6 + + movhlps m6, m2 + WORD_TO_DOUBLE m2 + WORD_TO_DOUBLE m6 + pmaddwd m2, m0 + pmaddwd m6, m0 + packssdw m2, m6 + + lea r0, [r0 + 4 * r1] + movd m6, [r0] + + punpcklbw m3, m4 + punpcklbw m7, m5, m6 + punpcklwd m3, m7 + + movhlps m7, m3 + WORD_TO_DOUBLE m3 + WORD_TO_DOUBLE m7 + pmaddwd m3, m0 + pmaddwd m7, m0 + packssdw m3, m7 + + pshuflw m7, m2, q2301 + pshufhw m7, m7, q2301 + paddw m2, m7 + pshuflw m7, m3, q2301 + pshufhw m7, m7, q2301 + paddw m3, m7 + psrld m2, 16 + psrld m3, 16 + packssdw m2, m3 + + paddw m2, m1 + psraw m2, 6 + + movd m7, [r0 + r1] + + punpcklbw m4, m5 + punpcklbw m3, m6, m7 + punpcklwd m4, m3 + + movhlps m3, m4 + WORD_TO_DOUBLE m4 + WORD_TO_DOUBLE m3 + pmaddwd m4, m0 + pmaddwd m3, m0 + packssdw m4, m3 + + movd m3, [r0 + 2 * r1] + + punpcklbw m5, m6 + punpcklbw m7, m3 + punpcklwd m5, m7 + + movhlps m3, m5 + WORD_TO_DOUBLE m5 + WORD_TO_DOUBLE m3 + pmaddwd m5, m0 + pmaddwd m3, m0 + packssdw m5, m3 + + pshuflw m7, m4, q2301 + pshufhw m7, m7, q2301 + paddw m4, m7 + pshuflw m7, m5, q2301 + pshufhw m7, m7, q2301 + paddw m5, m7 + psrld m4, 16 + psrld m5, 16 + packssdw m4, m5 + + paddw m4, m1 + psraw m4, 6 + packuswb m2, m4 + + movd [r2], m2 + psrldq m2, 4 + movd [r2 + r3], m2 + lea r2, [r2 + 2 * r3] + psrldq m2, 4 + movd [r2], m2 + psrldq m2, 4 + movd [r2 + r3], m2 + +%if x < %1/4 + lea r2, [r2 + 2 * r3] +%endif +%assign x x+1 +%endrep + RET +%endmacro + + FILTER_V4_W4_H4_sse2 4 + FILTER_V4_W4_H4_sse2 8 + FILTER_V4_W4_H4_sse2 16 + FILTER_V4_W4_H4_sse2 32 + +;----------------------------------------------------------------------------- +;void interp_4tap_vert_pp_6x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W6_H4_sse2 1 +INIT_XMM sse2 +cglobal interp_4tap_vert_pp_6x%1, 4, 7, 10 + + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + mova m6, [r5 + r4] + mova m5, [r5 + r4 + 16] +%else + mova m6, [tab_ChromaCoeffV + r4] + mova m5, [tab_ChromaCoeffV + r4 + 16] +%endif + + mova m4, [pw_32] + lea r5, [3 * r1] + +%assign x 1 +%rep %1/4 + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + movq m3, [r0 + r5] + + punpcklbw m0, m1 + punpcklbw m1, m2 + punpcklbw m2, m3 + + movhlps m7, m0 + punpcklbw m0, m9 + punpcklbw m7, m9 + pmaddwd m0, m6 + pmaddwd m7, m6 + packssdw m0, m7 + + movhlps m8, m2 + movq m7, m2 + punpcklbw m8, m9 + punpcklbw m7, m9 + pmaddwd m8, m5 + pmaddwd m7, m5 + packssdw m7, m8 + + paddw m0, m7 + + paddw m0, m4 + psraw m0, 6 + packuswb m0, m0 + movd [r2], m0 + pextrw r6d, m0, 2 + mov [r2 + 4], r6w + + lea r0, [r0 + 4 * r1] + + movq m0, [r0] + punpcklbw m3, m0 + + movhlps m8, m1 + punpcklbw m1, m9 + punpcklbw m8, m9 + pmaddwd m1, m6 + pmaddwd m8, m6 + packssdw m1, m8 + + movhlps m8, m3 + movq m7, m3 + punpcklbw m8, m9 + punpcklbw m7, m9 + pmaddwd m8, m5 + pmaddwd m7, m5 + packssdw m7, m8 + + paddw m1, m7 + + paddw m1, m4 + psraw m1, 6 + packuswb m1, m1 + movd [r2 + r3], m1 + pextrw r6d, m1, 2 + mov [r2 + r3 + 4], r6w + movq m1, [r0 + r1] + punpcklbw m7, m0, m1 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m6 + pmaddwd m8, m6 + packssdw m2, m8 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m5 + pmaddwd m8, m5 + packssdw m7, m8 + + paddw m2, m7 + + paddw m2, m4 + psraw m2, 6 + packuswb m2, m2 + lea r2, [r2 + 2 * r3] + movd [r2], m2 + pextrw r6d, m2, 2 + mov [r2 + 4], r6w + + movq m2, [r0 + 2 * r1] + punpcklbw m1, m2 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m6 + pmaddwd m8, m6 + packssdw m3, m8 + + movhlps m8, m1 + punpcklbw m1, m9 + punpcklbw m8, m9 + pmaddwd m1, m5 + pmaddwd m8, m5 + packssdw m1, m8 + + paddw m3, m1 + + paddw m3, m4 + psraw m3, 6 + packuswb m3, m3 + + movd [r2 + r3], m3 + pextrw r6d, m3, 2 + mov [r2 + r3 + 4], r6w + +%if x < %1/4 + lea r2, [r2 + 2 * r3] +%endif +%assign x x+1 +%endrep + RET + +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W6_H4_sse2 8 + FILTER_V4_W6_H4_sse2 16 +%endif + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W8_sse2 1 +INIT_XMM sse2 +cglobal interp_4tap_vert_pp_8x%1, 4, 7, 12 + + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + mova m4, [pw_32] + +%ifdef PIC + lea r6, [tab_ChromaCoeffV] + mova m6, [r6 + r4] + mova m5, [r6 + r4 + 16] +%else + mova m6, [tab_ChromaCoeffV + r4] + mova m5, [tab_ChromaCoeffV + r4 + 16] +%endif + + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + lea r5, [r0 + 2 * r1] + movq m3, [r5 + r1] + + punpcklbw m0, m1 + punpcklbw m7, m2, m3 + + movhlps m8, m0 + punpcklbw m0, m9 + punpcklbw m8, m9 + pmaddwd m0, m6 + pmaddwd m8, m6 + packssdw m0, m8 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m5 + pmaddwd m8, m5 + packssdw m7, m8 + + paddw m0, m7 + + paddw m0, m4 + psraw m0, 6 + + movq m11, [r0 + 4 * r1] + + punpcklbw m1, m2 + punpcklbw m7, m3, m11 + + movhlps m8, m1 + punpcklbw m1, m9 + punpcklbw m8, m9 + pmaddwd m1, m6 + pmaddwd m8, m6 + packssdw m1, m8 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m5 + pmaddwd m8, m5 + packssdw m7, m8 + + paddw m1, m7 + + paddw m1, m4 + psraw m1, 6 + packuswb m1, m0 + + movhps [r2], m1 + movh [r2 + r3], m1 +%if %1 == 2 ;end of 8x2 + RET + +%else + lea r6, [r0 + 4 * r1] + movq m1, [r6 + r1] + + punpcklbw m2, m3 + punpcklbw m7, m11, m1 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m6 + pmaddwd m8, m6 + packssdw m2, m8 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m5 + pmaddwd m8, m5 + packssdw m7, m8 + + paddw m2, m7 + + paddw m2, m4 + psraw m2, 6 + + movq m10, [r6 + 2 * r1] + + punpcklbw m3, m11 + punpcklbw m7, m1, m10 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m6 + pmaddwd m8, m6 + packssdw m3, m8 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m5 + pmaddwd m8, m5 + packssdw m7, m8 + + paddw m3, m7 + + paddw m3, m4 + psraw m3, 6 + packuswb m3, m2 + + movhps [r2 + 2 * r3], m3 + lea r5, [r2 + 2 * r3] + movh [r5 + r3], m3 +%if %1 == 4 ;end of 8x4 + RET + +%else + lea r6, [r6 + 2 * r1] + movq m3, [r6 + r1] + + punpcklbw m11, m1 + punpcklbw m7, m10, m3 + + movhlps m8, m11 + punpcklbw m11, m9 + punpcklbw m8, m9 + pmaddwd m11, m6 + pmaddwd m8, m6 + packssdw m11, m8 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m5 + pmaddwd m8, m5 + packssdw m7, m8 + + paddw m11, m7 + + paddw m11, m4 + psraw m11, 6 + + movq m7, [r0 + 8 * r1] + + punpcklbw m1, m10 + punpcklbw m3, m7 + + movhlps m8, m1 + punpcklbw m1, m9 + punpcklbw m8, m9 + pmaddwd m1, m6 + pmaddwd m8, m6 + packssdw m1, m8 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m5 + pmaddwd m8, m5 + packssdw m3, m8 + + paddw m1, m3 + + paddw m1, m4 + psraw m1, 6 + packuswb m1, m11 + + movhps [r2 + 4 * r3], m1 + lea r5, [r2 + 4 * r3] + movh [r5 + r3], m1 +%if %1 == 6 + RET + +%else + %error INVALID macro argument, only 2, 4 or 6! +%endif +%endif +%endif +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W8_sse2 2 + FILTER_V4_W8_sse2 4 + FILTER_V4_W8_sse2 6 +%endif + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W8_H8_H16_H32_sse2 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 11 + + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + mova m6, [r5 + r4] + mova m5, [r5 + r4 + 16] +%else + mova m6, [tab_ChromaCoeff + r4] + mova m5, [tab_ChromaCoeff + r4 + 16] +%endif + + mova m4, [pw_32] + lea r5, [r1 * 3] + +%assign x 1 +%rep %2/4 + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + movq m3, [r0 + r5] + + punpcklbw m0, m1 + punpcklbw m1, m2 + punpcklbw m2, m3 + + movhlps m7, m0 + punpcklbw m0, m9 + punpcklbw m7, m9 + pmaddwd m0, m6 + pmaddwd m7, m6 + packssdw m0, m7 + + movhlps m8, m2 + movq m7, m2 + punpcklbw m8, m9 + punpcklbw m7, m9 + pmaddwd m8, m5 + pmaddwd m7, m5 + packssdw m7, m8 + + paddw m0, m7 + paddw m0, m4 + psraw m0, 6 + + lea r0, [r0 + 4 * r1] + movq m10, [r0] + punpcklbw m3, m10 + + movhlps m8, m1 + punpcklbw m1, m9 + punpcklbw m8, m9 + pmaddwd m1, m6 + pmaddwd m8, m6 + packssdw m1, m8 + + movhlps m8, m3 + movq m7, m3 + punpcklbw m8, m9 + punpcklbw m7, m9 + pmaddwd m8, m5 + pmaddwd m7, m5 + packssdw m7, m8 + + paddw m1, m7 + paddw m1, m4 + psraw m1, 6 + + packuswb m0, m1 + movh [r2], m0 + movhps [r2 + r3], m0 + + movq m1, [r0 + r1] + punpcklbw m10, m1 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m6 + pmaddwd m8, m6 + packssdw m2, m8 + + movhlps m8, m10 + punpcklbw m10, m9 + punpcklbw m8, m9 + pmaddwd m10, m5 + pmaddwd m8, m5 + packssdw m10, m8 + + paddw m2, m10 + paddw m2, m4 + psraw m2, 6 + + movq m7, [r0 + 2 * r1] + punpcklbw m1, m7 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m6 + pmaddwd m8, m6 + packssdw m3, m8 + + movhlps m8, m1 + punpcklbw m1, m9 + punpcklbw m8, m9 + pmaddwd m1, m5 + pmaddwd m8, m5 + packssdw m1, m8 + + paddw m3, m1 + paddw m3, m4 + psraw m3, 6 + + packuswb m2, m3 + lea r2, [r2 + 2 * r3] + movh [r2], m2 + movhps [r2 + r3], m2 +%if x < %2/4 + lea r2, [r2 + 2 * r3] +%endif +%endrep + RET +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W8_H8_H16_H32_sse2 8, 8 + FILTER_V4_W8_H8_H16_H32_sse2 8, 16 + FILTER_V4_W8_H8_H16_H32_sse2 8, 32 + + FILTER_V4_W8_H8_H16_H32_sse2 8, 12 + FILTER_V4_W8_H8_H16_H32_sse2 8, 64 +%endif + ;----------------------------------------------------------------------------- ; void interp_4tap_horiz_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- @@ -328,26 +1887,26 @@ %define t1 m1 %define t0 m0 -mov r4d, r4m + mov r4d, r4m %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd coef2, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] %else -movd coef2, [tab_ChromaCoeff + r4 * 4] + movd coef2, [tab_ChromaCoeff + r4 * 4] %endif -pshufd coef2, coef2, 0 -mova t2, [pw_512] -mova Tm0, [tab_Tm] + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] %rep 2 -FILTER_H4_w2_2 t0, t1, t2 -lea srcq, [srcq + srcstrideq * 2] -lea dstq, [dstq + dststrideq * 2] + FILTER_H4_w2_2 t0, t1, t2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] %endrep -RET + RET ;----------------------------------------------------------------------------- ; void interp_4tap_horiz_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -360,26 +1919,26 @@ %define t1 m1 %define t0 m0 -mov r4d, r4m + mov r4d, r4m %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd coef2, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] %else -movd coef2, [tab_ChromaCoeff + r4 * 4] + movd coef2, [tab_ChromaCoeff + r4 * 4] %endif -pshufd coef2, coef2, 0 -mova t2, [pw_512] -mova Tm0, [tab_Tm] + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] %rep 4 -FILTER_H4_w2_2 t0, t1, t2 -lea srcq, [srcq + srcstrideq * 2] -lea dstq, [dstq + dststrideq * 2] + FILTER_H4_w2_2 t0, t1, t2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] %endrep -RET + RET ;----------------------------------------------------------------------------- ; void interp_4tap_horiz_pp_2x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -392,29 +1951,29 @@ %define t1 m1 %define t0 m0 -mov r4d, r4m + mov r4d, r4m %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd coef2, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] %else -movd coef2, [tab_ChromaCoeff + r4 * 4] + movd coef2, [tab_ChromaCoeff + r4 * 4] %endif -pshufd coef2, coef2, 0 -mova t2, [pw_512] -mova Tm0, [tab_Tm] + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] -mov r5d, 16/2 + mov r5d, 16/2 .loop: -FILTER_H4_w2_2 t0, t1, t2 -lea srcq, [srcq + srcstrideq * 2] -lea dstq, [dstq + dststrideq * 2] -dec r5d -jnz .loop + FILTER_H4_w2_2 t0, t1, t2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] + dec r5d + jnz .loop -RET + RET %macro FILTER_H4_w4_2 3 movh %2, [srcq - 1] @@ -442,22 +2001,22 @@ %define t1 m1 %define t0 m0 -mov r4d, r4m + mov r4d, r4m %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd coef2, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] %else -movd coef2, [tab_ChromaCoeff + r4 * 4] + movd coef2, [tab_ChromaCoeff + r4 * 4] %endif -pshufd coef2, coef2, 0 -mova t2, [pw_512] -mova Tm0, [tab_Tm] + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] -FILTER_H4_w4_2 t0, t1, t2 + FILTER_H4_w4_2 t0, t1, t2 -RET + RET ;----------------------------------------------------------------------------- ; void interp_4tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -470,26 +2029,26 @@ %define t1 m1 %define t0 m0 -mov r4d, r4m + mov r4d, r4m %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd coef2, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] %else -movd coef2, [tab_ChromaCoeff + r4 * 4] + movd coef2, [tab_ChromaCoeff + r4 * 4] %endif -pshufd coef2, coef2, 0 -mova t2, [pw_512] -mova Tm0, [tab_Tm] + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] %rep 2 -FILTER_H4_w4_2 t0, t1, t2 -lea srcq, [srcq + srcstrideq * 2] -lea dstq, [dstq + dststrideq * 2] + FILTER_H4_w4_2 t0, t1, t2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] %endrep -RET + RET ;----------------------------------------------------------------------------- ; void interp_4tap_horiz_pp_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -502,26 +2061,26 @@ %define t1 m1 %define t0 m0 -mov r4d, r4m + mov r4d, r4m %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd coef2, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] %else -movd coef2, [tab_ChromaCoeff + r4 * 4] + movd coef2, [tab_ChromaCoeff + r4 * 4] %endif -pshufd coef2, coef2, 0 -mova t2, [pw_512] -mova Tm0, [tab_Tm] + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] %rep 4 -FILTER_H4_w4_2 t0, t1, t2 -lea srcq, [srcq + srcstrideq * 2] -lea dstq, [dstq + dststrideq * 2] + FILTER_H4_w4_2 t0, t1, t2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] %endrep -RET + RET ;----------------------------------------------------------------------------- ; void interp_4tap_horiz_pp_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -534,26 +2093,26 @@ %define t1 m1 %define t0 m0 -mov r4d, r4m + mov r4d, r4m %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd coef2, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] %else -movd coef2, [tab_ChromaCoeff + r4 * 4] + movd coef2, [tab_ChromaCoeff + r4 * 4] %endif -pshufd coef2, coef2, 0 -mova t2, [pw_512] -mova Tm0, [tab_Tm] + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] %rep 8 -FILTER_H4_w4_2 t0, t1, t2 -lea srcq, [srcq + srcstrideq * 2] -lea dstq, [dstq + dststrideq * 2] + FILTER_H4_w4_2 t0, t1, t2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] %endrep -RET + RET ;----------------------------------------------------------------------------- ; void interp_4tap_horiz_pp_4x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -566,29 +2125,29 @@ %define t1 m1 %define t0 m0 -mov r4d, r4m + mov r4d, r4m %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd coef2, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] %else -movd coef2, [tab_ChromaCoeff + r4 * 4] + movd coef2, [tab_ChromaCoeff + r4 * 4] %endif -pshufd coef2, coef2, 0 -mova t2, [pw_512] -mova Tm0, [tab_Tm] + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] -mov r5d, 32/2 + mov r5d, 32/2 .loop: -FILTER_H4_w4_2 t0, t1, t2 -lea srcq, [srcq + srcstrideq * 2] -lea dstq, [dstq + dststrideq * 2] -dec r5d -jnz .loop + FILTER_H4_w4_2 t0, t1, t2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] + dec r5d + jnz .loop -RET + RET ALIGN 32 const interp_4tap_8x8_horiz_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7 @@ -764,47 +2323,47 @@ %define t1 m1 %define t0 m0 -mov r4d, r4m + mov r4d, r4m %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd coef2, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] %else -movd coef2, [tab_ChromaCoeff + r4 * 4] + movd coef2, [tab_ChromaCoeff + r4 * 4] %endif -mov r5d, %2 + mov r5d, %2 -pshufd coef2, coef2, 0 -mova t2, [pw_512] -mova Tm0, [tab_Tm] -mova Tm1, [tab_Tm + 16] + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] + mova Tm1, [tab_Tm + 16] .loop: -FILTER_H4_w%1 t0, t1, t2 -add srcq, srcstrideq -add dstq, dststrideq - -dec r5d -jnz .loop - -RET + FILTER_H4_w%1 t0, t1, t2 + add srcq, srcstrideq + add dstq, dststrideq + + dec r5d + jnz .loop + + RET %endmacro -IPFILTER_CHROMA 6, 8 -IPFILTER_CHROMA 8, 2 -IPFILTER_CHROMA 8, 4 -IPFILTER_CHROMA 8, 6 -IPFILTER_CHROMA 8, 8 -IPFILTER_CHROMA 8, 16 -IPFILTER_CHROMA 8, 32 -IPFILTER_CHROMA 12, 16 - -IPFILTER_CHROMA 6, 16 -IPFILTER_CHROMA 8, 12 -IPFILTER_CHROMA 8, 64 -IPFILTER_CHROMA 12, 32 + IPFILTER_CHROMA 6, 8 + IPFILTER_CHROMA 8, 2 + IPFILTER_CHROMA 8, 4 + IPFILTER_CHROMA 8, 6 + IPFILTER_CHROMA 8, 8 + IPFILTER_CHROMA 8, 16 + IPFILTER_CHROMA 8, 32 + IPFILTER_CHROMA 12, 16 + + IPFILTER_CHROMA 6, 16 + IPFILTER_CHROMA 8, 12 + IPFILTER_CHROMA 8, 64 + IPFILTER_CHROMA 12, 32 ;----------------------------------------------------------------------------- ; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -820,55 +2379,55 @@ %define t1 m1 %define t0 m0 -mov r4d, r4m + mov r4d, r4m %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd coef2, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd coef2, [r5 + r4 * 4] %else -movd coef2, [tab_ChromaCoeff + r4 * 4] + movd coef2, [tab_ChromaCoeff + r4 * 4] %endif -mov r5d, %2 + mov r5d, %2 -pshufd coef2, coef2, 0 -mova t2, [pw_512] -mova Tm0, [tab_Tm] -mova Tm1, [tab_Tm + 16] + pshufd coef2, coef2, 0 + mova t2, [pw_512] + mova Tm0, [tab_Tm] + mova Tm1, [tab_Tm + 16] .loop: -FILTER_H4_w%1 t0, t1, t2, t3 -add srcq, srcstrideq -add dstq, dststrideq - -dec r5d -jnz .loop - -RET -%endmacro - -IPFILTER_CHROMA_W 16, 4 -IPFILTER_CHROMA_W 16, 8 -IPFILTER_CHROMA_W 16, 12 -IPFILTER_CHROMA_W 16, 16 -IPFILTER_CHROMA_W 16, 32 -IPFILTER_CHROMA_W 32, 8 -IPFILTER_CHROMA_W 32, 16 -IPFILTER_CHROMA_W 32, 24 -IPFILTER_CHROMA_W 24, 32 -IPFILTER_CHROMA_W 32, 32 - -IPFILTER_CHROMA_W 16, 24 -IPFILTER_CHROMA_W 16, 64 -IPFILTER_CHROMA_W 32, 48 -IPFILTER_CHROMA_W 24, 64 -IPFILTER_CHROMA_W 32, 64 - -IPFILTER_CHROMA_W 64, 64 -IPFILTER_CHROMA_W 64, 32 -IPFILTER_CHROMA_W 64, 48 -IPFILTER_CHROMA_W 48, 64 -IPFILTER_CHROMA_W 64, 16 + FILTER_H4_w%1 t0, t1, t2, t3 + add srcq, srcstrideq + add dstq, dststrideq + + dec r5d + jnz .loop + + RET +%endmacro + + IPFILTER_CHROMA_W 16, 4 + IPFILTER_CHROMA_W 16, 8 + IPFILTER_CHROMA_W 16, 12 + IPFILTER_CHROMA_W 16, 16 + IPFILTER_CHROMA_W 16, 32 + IPFILTER_CHROMA_W 32, 8 + IPFILTER_CHROMA_W 32, 16 + IPFILTER_CHROMA_W 32, 24 + IPFILTER_CHROMA_W 24, 32 + IPFILTER_CHROMA_W 32, 32 + + IPFILTER_CHROMA_W 16, 24 + IPFILTER_CHROMA_W 16, 64 + IPFILTER_CHROMA_W 32, 48 + IPFILTER_CHROMA_W 24, 64 + IPFILTER_CHROMA_W 32, 64 + + IPFILTER_CHROMA_W 64, 64 + IPFILTER_CHROMA_W 64, 32 + IPFILTER_CHROMA_W 64, 48 + IPFILTER_CHROMA_W 48, 64 + IPFILTER_CHROMA_W 64, 16 %macro FILTER_H8_W8 7-8 ; t0, t1, t2, t3, coef, c512, src, dst @@ -918,7 +2477,7 @@ %endif punpcklqdq m3, m3 -%ifidn %3, pp +%ifidn %3, pp mova m2, [pw_512] %else mova m2, [pw_2000] @@ -937,7 +2496,7 @@ .loopH: xor r5, r5 %rep %1 / 8 - %ifidn %3, pp + %ifidn %3, pp FILTER_H8_W8 m0, m1, m4, m5, m3, m2, [r0 - 3 + r5], [r2 + r5] %else FILTER_H8_W8 m0, m1, m4, m5, m3, UNUSED, [r0 - 3 + r5] @@ -949,7 +2508,7 @@ %rep (%1 % 8) / 4 FILTER_H8_W4 m0, m1 - %ifidn %3, pp + %ifidn %3, pp pmulhrsw m1, m2 packuswb m1, m1 movd [r2 + r5], m1 @@ -1120,8 +2679,8 @@ %endif %endmacro -FILTER_HORIZ_LUMA_AVX2_4xN 8 -FILTER_HORIZ_LUMA_AVX2_4xN 16 + FILTER_HORIZ_LUMA_AVX2_4xN 8 + FILTER_HORIZ_LUMA_AVX2_4xN 16 INIT_YMM avx2 cglobal interp_8tap_horiz_pp_8x4, 4, 6, 7 @@ -1271,9 +2830,9 @@ RET %endmacro -IPFILTER_LUMA_AVX2_8xN 8, 8 -IPFILTER_LUMA_AVX2_8xN 8, 16 -IPFILTER_LUMA_AVX2_8xN 8, 32 + IPFILTER_LUMA_AVX2_8xN 8, 8 + IPFILTER_LUMA_AVX2_8xN 8, 16 + IPFILTER_LUMA_AVX2_8xN 8, 32 %macro IPFILTER_LUMA_AVX2 2 INIT_YMM avx2 @@ -1306,7 +2865,7 @@ pmaddubsw m5, m1 paddw m4, m5 pmaddwd m4, m7 - vbroadcasti128 m5, [r0 + 8] ; second 8 elements in Row0 + vbroadcasti128 m5, [r0 + 8] ; second 8 elements in Row0 pshufb m6, m5, m3 pshufb m5, [tab_Tm] pmaddubsw m5, m0 @@ -1322,7 +2881,7 @@ pmaddubsw m5, m1 paddw m2, m5 pmaddwd m2, m7 - vbroadcasti128 m5, [r0 + r1 + 8] ; second 8 elements in Row0 + vbroadcasti128 m5, [r0 + r1 + 8] ; second 8 elements in Row0 pshufb m6, m5, m3 pshufb m5, [tab_Tm] pmaddubsw m5, m0 @@ -1617,7 +3176,7 @@ jnz .loop RET -INIT_YMM avx2 +INIT_YMM avx2 cglobal interp_4tap_horiz_pp_4x4, 4,6,6 mov r4d, r4m @@ -1665,7 +3224,7 @@ pextrd [r2+r0], xm3, 3 RET -INIT_YMM avx2 +INIT_YMM avx2 cglobal interp_4tap_horiz_pp_2x4, 4, 6, 3 mov r4d, r4m @@ -1698,7 +3257,7 @@ pextrw [r2 + r4], xm1, 3 RET -INIT_YMM avx2 +INIT_YMM avx2 cglobal interp_4tap_horiz_pp_2x8, 4, 6, 6 mov r4d, r4m @@ -1941,7 +3500,7 @@ IPFILTER_LUMA_AVX2 16, 4 IPFILTER_LUMA_AVX2 16, 8 - IPFILTER_LUMA_AVX2 16, 12 + IPFILTER_LUMA_AVX2 16, 12 IPFILTER_LUMA_AVX2 16, 16 IPFILTER_LUMA_AVX2 16, 32 IPFILTER_LUMA_AVX2 16, 64 @@ -2144,6 +3703,108 @@ RET ;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_64xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;-----------------------------------------------------------------------------------------------------------------------------; +%macro IPFILTER_CHROMA_HPS_64xN 1 +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_64x%1, 4,7,6 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m5, [pw_2000] + mova m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + mov r6d, %1 + dec r0 + test r5d, r5d + je .loop + sub r0 , r1 + add r6d , 3 + +.loop + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2], m3 + + vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 24] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2 + 32], m3 + + vbroadcasti128 m3, [r0 + 32] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 40] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2 + 64], m3 + + vbroadcasti128 m3, [r0 + 48] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 56] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2 + 96], m3 + + add r2, r3 + add r0, r1 + dec r6d + jnz .loop + RET +%endmacro + + IPFILTER_CHROMA_HPS_64xN 64 + IPFILTER_CHROMA_HPS_64xN 32 + IPFILTER_CHROMA_HPS_64xN 48 + IPFILTER_CHROMA_HPS_64xN 16 + +;----------------------------------------------------------------------------------------------------------------------------- ;void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) ;----------------------------------------------------------------------------------------------------------------------------- @@ -2230,7 +3891,7 @@ pshufb m4, m1 pmaddubsw m4, m0 phaddw m4, m4 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] - phaddw m3, m4 + phaddw m3, m4 vpermd m3, m5, m3 ; m5 don't broken in above psubw m3, m2 @@ -2312,7 +3973,7 @@ lea r2, [r2 + r3 * 2] ; first loop dst ->5th row(i.e 4) sub r5d, 2 jg .loop - jz .end + jz .end ; last row movu xm1, [r0] @@ -2334,10 +3995,10 @@ %endif %endmacro ; IPFILTER_LUMA_PS_8xN_AVX2 -IPFILTER_LUMA_PS_8xN_AVX2 4 -IPFILTER_LUMA_PS_8xN_AVX2 8 -IPFILTER_LUMA_PS_8xN_AVX2 16 -IPFILTER_LUMA_PS_8xN_AVX2 32 + IPFILTER_LUMA_PS_8xN_AVX2 4 + IPFILTER_LUMA_PS_8xN_AVX2 8 + IPFILTER_LUMA_PS_8xN_AVX2 16 + IPFILTER_LUMA_PS_8xN_AVX2 32 %macro IPFILTER_LUMA_PS_16x_AVX2 2 @@ -2399,17 +4060,17 @@ dec r9d jnz .label -RET + RET %endif %endmacro -IPFILTER_LUMA_PS_16x_AVX2 16 , 16 -IPFILTER_LUMA_PS_16x_AVX2 16 , 8 -IPFILTER_LUMA_PS_16x_AVX2 16 , 12 -IPFILTER_LUMA_PS_16x_AVX2 16 , 4 -IPFILTER_LUMA_PS_16x_AVX2 16 , 32 -IPFILTER_LUMA_PS_16x_AVX2 16 , 64 + IPFILTER_LUMA_PS_16x_AVX2 16 , 16 + IPFILTER_LUMA_PS_16x_AVX2 16 , 8 + IPFILTER_LUMA_PS_16x_AVX2 16 , 12 + IPFILTER_LUMA_PS_16x_AVX2 16 , 4 + IPFILTER_LUMA_PS_16x_AVX2 16 , 32 + IPFILTER_LUMA_PS_16x_AVX2 16 , 64 ;-------------------------------------------------------------------------------------------------------------- @@ -2460,27 +4121,27 @@ RET %endmacro -IPFILTER_LUMA_PP_W8 8, 4 -IPFILTER_LUMA_PP_W8 8, 8 -IPFILTER_LUMA_PP_W8 8, 16 -IPFILTER_LUMA_PP_W8 8, 32 -IPFILTER_LUMA_PP_W8 16, 4 -IPFILTER_LUMA_PP_W8 16, 8 -IPFILTER_LUMA_PP_W8 16, 12 -IPFILTER_LUMA_PP_W8 16, 16 -IPFILTER_LUMA_PP_W8 16, 32 -IPFILTER_LUMA_PP_W8 16, 64 -IPFILTER_LUMA_PP_W8 24, 32 -IPFILTER_LUMA_PP_W8 32, 8 -IPFILTER_LUMA_PP_W8 32, 16 -IPFILTER_LUMA_PP_W8 32, 24 -IPFILTER_LUMA_PP_W8 32, 32 -IPFILTER_LUMA_PP_W8 32, 64 -IPFILTER_LUMA_PP_W8 48, 64 -IPFILTER_LUMA_PP_W8 64, 16 -IPFILTER_LUMA_PP_W8 64, 32 -IPFILTER_LUMA_PP_W8 64, 48 -IPFILTER_LUMA_PP_W8 64, 64 + IPFILTER_LUMA_PP_W8 8, 4 + IPFILTER_LUMA_PP_W8 8, 8 + IPFILTER_LUMA_PP_W8 8, 16 + IPFILTER_LUMA_PP_W8 8, 32 + IPFILTER_LUMA_PP_W8 16, 4 + IPFILTER_LUMA_PP_W8 16, 8 + IPFILTER_LUMA_PP_W8 16, 12 + IPFILTER_LUMA_PP_W8 16, 16 + IPFILTER_LUMA_PP_W8 16, 32 + IPFILTER_LUMA_PP_W8 16, 64 + IPFILTER_LUMA_PP_W8 24, 32 + IPFILTER_LUMA_PP_W8 32, 8 + IPFILTER_LUMA_PP_W8 32, 16 + IPFILTER_LUMA_PP_W8 32, 24 + IPFILTER_LUMA_PP_W8 32, 32 + IPFILTER_LUMA_PP_W8 32, 64 + IPFILTER_LUMA_PP_W8 48, 64 + IPFILTER_LUMA_PP_W8 64, 16 + IPFILTER_LUMA_PP_W8 64, 32 + IPFILTER_LUMA_PP_W8 64, 48 + IPFILTER_LUMA_PP_W8 64, 64 ;---------------------------------------------------------------------------------------------------------------------------- ; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) @@ -2547,10 +4208,10 @@ ; Round and Saturate %macro FILTER_HV8_END 4 ; output in [1, 3] - paddd %1, [tab_c_526336] - paddd %2, [tab_c_526336] - paddd %3, [tab_c_526336] - paddd %4, [tab_c_526336] + paddd %1, [pd_526336] + paddd %2, [pd_526336] + paddd %3, [pd_526336] + paddd %4, [pd_526336] psrad %1, 12 psrad %2, 12 psrad %3, 12 @@ -2565,7 +4226,7 @@ ;----------------------------------------------------------------------------- ; void interp_8tap_hv_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY) ;----------------------------------------------------------------------------- -INIT_XMM sse4 +INIT_XMM ssse3 cglobal interp_8tap_hv_pp_8x8, 4, 7, 8, 0-15*16 %define coef m7 %define stk_buf rsp @@ -2640,76 +4301,148 @@ RET ;----------------------------------------------------------------------------- +; void interp_8tap_hv_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY) +;----------------------------------------------------------------------------- +INIT_XMM sse3 +cglobal interp_8tap_hv_pp_8x8, 4, 7, 8, 0-15*16 + mov r4d, r4m + mov r5d, r5m + add r4d, r4d + pxor m6, m6 + +%ifdef PIC + lea r6, [tabw_LumaCoeff] + mova m3, [r6 + r4 * 8] +%else + mova m3, [tabw_LumaCoeff + r4 * 8] +%endif + + ; move to row -3 + lea r6, [r1 + r1 * 2] + sub r0, r6 + + mov r4, rsp + +%assign x 0 ;needed for FILTER_H8_W8_sse2 macro +%assign y 1 +%rep 15 + FILTER_H8_W8_sse2 + psubw m1, [pw_2000] + mova [r4], m1 + +%if y < 15 + add r0, r1 + add r4, 16 +%endif +%assign y y+1 +%endrep + + ; ready to phase V + ; Here all of mN is free + + ; load coeff table + shl r5, 6 + lea r6, [tab_LumaCoeffV] + lea r5, [r5 + r6] + + ; load intermedia buffer + mov r0, rsp + + ; register mapping + ; r0 - src + ; r5 - coeff + + ; let's go +%assign y 1 +%rep 4 + FILTER_HV8_START m1, m2, m3, m4, m0, 0, 0 + FILTER_HV8_MID m6, m2, m3, m4, m0, m1, m7, m5, 3, 1 + FILTER_HV8_MID m5, m6, m3, m4, m0, m1, m7, m2, 5, 2 + FILTER_HV8_MID m6, m5, m3, m4, m0, m1, m7, m2, 7, 3 + FILTER_HV8_END m3, m0, m4, m1 + + movh [r2], m3 + movhps [r2 + r3], m3 + +%if y < 4 + lea r0, [r0 + 16 * 2] + lea r2, [r2 + r3 * 2] +%endif +%assign y y+1 +%endrep + RET + +;----------------------------------------------------------------------------- ;void interp_4tap_vert_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- INIT_XMM sse4 cglobal interp_4tap_vert_pp_2x4, 4, 6, 8 -mov r4d, r4m -sub r0, r1 + mov r4d, r4m + sub r0, r1 %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd m0, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] %else -movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [tab_ChromaCoeff + r4 * 4] %endif -lea r4, [r1 * 3] -lea r5, [r0 + 4 * r1] -pshufb m0, [tab_Cm] -mova m1, [pw_512] + lea r4, [r1 * 3] + lea r5, [r0 + 4 * r1] + pshufb m0, [tab_Cm] + mova m1, [pw_512] -movd m2, [r0] -movd m3, [r0 + r1] -movd m4, [r0 + 2 * r1] -movd m5, [r0 + r4] + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r4] -punpcklbw m2, m3 -punpcklbw m6, m4, m5 -punpcklbw m2, m6 + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklbw m2, m6 -pmaddubsw m2, m0 + pmaddubsw m2, m0 -movd m6, [r5] + movd m6, [r5] -punpcklbw m3, m4 -punpcklbw m7, m5, m6 -punpcklbw m3, m7 + punpcklbw m3, m4 + punpcklbw m7, m5, m6 + punpcklbw m3, m7 -pmaddubsw m3, m0 + pmaddubsw m3, m0 -phaddw m2, m3 + phaddw m2, m3 -pmulhrsw m2, m1 + pmulhrsw m2, m1 -movd m7, [r5 + r1] + movd m7, [r5 + r1] -punpcklbw m4, m5 -punpcklbw m3, m6, m7 -punpcklbw m4, m3 + punpcklbw m4, m5 + punpcklbw m3, m6, m7 + punpcklbw m4, m3 -pmaddubsw m4, m0 + pmaddubsw m4, m0 -movd m3, [r5 + 2 * r1] + movd m3, [r5 + 2 * r1] -punpcklbw m5, m6 -punpcklbw m7, m3 -punpcklbw m5, m7 + punpcklbw m5, m6 + punpcklbw m7, m3 + punpcklbw m5, m7 -pmaddubsw m5, m0 + pmaddubsw m5, m0 -phaddw m4, m5 + phaddw m4, m5 -pmulhrsw m4, m1 -packuswb m2, m4 + pmulhrsw m4, m1 + packuswb m2, m4 -pextrw [r2], m2, 0 -pextrw [r2 + r3], m2, 2 -lea r2, [r2 + 2 * r3] -pextrw [r2], m2, 4 -pextrw [r2 + r3], m2, 6 + pextrw [r2], m2, 0 + pextrw [r2 + r3], m2, 2 + lea r2, [r2 + 2 * r3] + pextrw [r2], m2, 4 + pextrw [r2 + r3], m2, 6 -RET + RET %macro FILTER_VER_CHROMA_AVX2_2x4 1 INIT_YMM avx2 @@ -2762,8 +4495,8 @@ RET %endmacro -FILTER_VER_CHROMA_AVX2_2x4 pp -FILTER_VER_CHROMA_AVX2_2x4 ps + FILTER_VER_CHROMA_AVX2_2x4 pp + FILTER_VER_CHROMA_AVX2_2x4 ps %macro FILTER_VER_CHROMA_AVX2_2x8 1 INIT_YMM avx2 @@ -2834,8 +4567,8 @@ RET %endmacro -FILTER_VER_CHROMA_AVX2_2x8 pp -FILTER_VER_CHROMA_AVX2_2x8 ps + FILTER_VER_CHROMA_AVX2_2x8 pp + FILTER_VER_CHROMA_AVX2_2x8 ps ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -2844,85 +4577,85 @@ INIT_XMM sse4 cglobal interp_4tap_vert_pp_2x%2, 4, 6, 8 -mov r4d, r4m -sub r0, r1 + mov r4d, r4m + sub r0, r1 %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd m0, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] %else -movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [tab_ChromaCoeff + r4 * 4] %endif -pshufb m0, [tab_Cm] + pshufb m0, [tab_Cm] -mova m1, [pw_512] + mova m1, [pw_512] -mov r4d, %2 -lea r5, [3 * r1] + mov r4d, %2 + lea r5, [3 * r1] .loop: -movd m2, [r0] -movd m3, [r0 + r1] -movd m4, [r0 + 2 * r1] -movd m5, [r0 + r5] + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r5] -punpcklbw m2, m3 -punpcklbw m6, m4, m5 -punpcklbw m2, m6 + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklbw m2, m6 -pmaddubsw m2, m0 + pmaddubsw m2, m0 -lea r0, [r0 + 4 * r1] -movd m6, [r0] + lea r0, [r0 + 4 * r1] + movd m6, [r0] -punpcklbw m3, m4 -punpcklbw m7, m5, m6 -punpcklbw m3, m7 + punpcklbw m3, m4 + punpcklbw m7, m5, m6 + punpcklbw m3, m7 -pmaddubsw m3, m0 + pmaddubsw m3, m0 -phaddw m2, m3 + phaddw m2, m3 -pmulhrsw m2, m1 + pmulhrsw m2, m1 -movd m7, [r0 + r1] + movd m7, [r0 + r1] -punpcklbw m4, m5 -punpcklbw m3, m6, m7 -punpcklbw m4, m3 + punpcklbw m4, m5 + punpcklbw m3, m6, m7 + punpcklbw m4, m3 -pmaddubsw m4, m0 + pmaddubsw m4, m0 -movd m3, [r0 + 2 * r1] + movd m3, [r0 + 2 * r1] -punpcklbw m5, m6 -punpcklbw m7, m3 -punpcklbw m5, m7 + punpcklbw m5, m6 + punpcklbw m7, m3 + punpcklbw m5, m7 -pmaddubsw m5, m0 + pmaddubsw m5, m0 -phaddw m4, m5 + phaddw m4, m5 -pmulhrsw m4, m1 -packuswb m2, m4 + pmulhrsw m4, m1 + packuswb m2, m4 -pextrw [r2], m2, 0 -pextrw [r2 + r3], m2, 2 -lea r2, [r2 + 2 * r3] -pextrw [r2], m2, 4 -pextrw [r2 + r3], m2, 6 + pextrw [r2], m2, 0 + pextrw [r2 + r3], m2, 2 + lea r2, [r2 + 2 * r3] + pextrw [r2], m2, 4 + pextrw [r2 + r3], m2, 6 -lea r2, [r2 + 2 * r3] + lea r2, [r2 + 2 * r3] -sub r4, 4 -jnz .loop -RET + sub r4, 4 + jnz .loop + RET %endmacro -FILTER_V4_W2_H4 2, 8 + FILTER_V4_W2_H4 2, 8 -FILTER_V4_W2_H4 2, 16 + FILTER_V4_W2_H4 2, 16 ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -2930,46 +4663,46 @@ INIT_XMM sse4 cglobal interp_4tap_vert_pp_4x2, 4, 6, 6 -mov r4d, r4m -sub r0, r1 + mov r4d, r4m + sub r0, r1 %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd m0, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] %else -movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [tab_ChromaCoeff + r4 * 4] %endif -pshufb m0, [tab_Cm] -lea r5, [r0 + 2 * r1] + pshufb m0, [tab_Cm] + lea r5, [r0 + 2 * r1] -movd m2, [r0] -movd m3, [r0 + r1] -movd m4, [r5] -movd m5, [r5 + r1] + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r5] + movd m5, [r5 + r1] -punpcklbw m2, m3 -punpcklbw m1, m4, m5 -punpcklbw m2, m1 + punpcklbw m2, m3 + punpcklbw m1, m4, m5 + punpcklbw m2, m1 -pmaddubsw m2, m0 + pmaddubsw m2, m0 -movd m1, [r0 + 4 * r1] + movd m1, [r0 + 4 * r1] -punpcklbw m3, m4 -punpcklbw m5, m1 -punpcklbw m3, m5 + punpcklbw m3, m4 + punpcklbw m5, m1 + punpcklbw m3, m5 -pmaddubsw m3, m0 + pmaddubsw m3, m0 -phaddw m2, m3 + phaddw m2, m3 -pmulhrsw m2, [pw_512] -packuswb m2, m2 -movd [r2], m2 -pextrd [r2 + r3], m2, 1 + pmulhrsw m2, [pw_512] + packuswb m2, m2 + movd [r2], m2 + pextrd [r2 + r3], m2, 1 -RET + RET %macro FILTER_VER_CHROMA_AVX2_4x2 1 INIT_YMM avx2 @@ -3017,8 +4750,8 @@ RET %endmacro -FILTER_VER_CHROMA_AVX2_4x2 pp -FILTER_VER_CHROMA_AVX2_4x2 ps + FILTER_VER_CHROMA_AVX2_4x2 pp + FILTER_VER_CHROMA_AVX2_4x2 ps ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -3026,71 +4759,71 @@ INIT_XMM sse4 cglobal interp_4tap_vert_pp_4x4, 4, 6, 8 -mov r4d, r4m -sub r0, r1 + mov r4d, r4m + sub r0, r1 %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd m0, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] %else -movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [tab_ChromaCoeff + r4 * 4] %endif -pshufb m0, [tab_Cm] -mova m1, [pw_512] -lea r5, [r0 + 4 * r1] -lea r4, [r1 * 3] + pshufb m0, [tab_Cm] + mova m1, [pw_512] + lea r5, [r0 + 4 * r1] + lea r4, [r1 * 3] -movd m2, [r0] -movd m3, [r0 + r1] -movd m4, [r0 + 2 * r1] -movd m5, [r0 + r4] + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r4] -punpcklbw m2, m3 -punpcklbw m6, m4, m5 -punpcklbw m2, m6 + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklbw m2, m6 -pmaddubsw m2, m0 + pmaddubsw m2, m0 -movd m6, [r5] + movd m6, [r5] -punpcklbw m3, m4 -punpcklbw m7, m5, m6 -punpcklbw m3, m7 + punpcklbw m3, m4 + punpcklbw m7, m5, m6 + punpcklbw m3, m7 -pmaddubsw m3, m0 + pmaddubsw m3, m0 -phaddw m2, m3 + phaddw m2, m3 -pmulhrsw m2, m1 + pmulhrsw m2, m1 -movd m7, [r5 + r1] + movd m7, [r5 + r1] -punpcklbw m4, m5 -punpcklbw m3, m6, m7 -punpcklbw m4, m3 + punpcklbw m4, m5 + punpcklbw m3, m6, m7 + punpcklbw m4, m3 -pmaddubsw m4, m0 + pmaddubsw m4, m0 -movd m3, [r5 + 2 * r1] + movd m3, [r5 + 2 * r1] -punpcklbw m5, m6 -punpcklbw m7, m3 -punpcklbw m5, m7 + punpcklbw m5, m6 + punpcklbw m7, m3 + punpcklbw m5, m7 -pmaddubsw m5, m0 + pmaddubsw m5, m0 -phaddw m4, m5 + phaddw m4, m5 -pmulhrsw m4, m1 + pmulhrsw m4, m1 -packuswb m2, m4 -movd [r2], m2 -pextrd [r2 + r3], m2, 1 -lea r2, [r2 + 2 * r3] -pextrd [r2], m2, 2 -pextrd [r2 + r3], m2, 3 -RET + packuswb m2, m4 + movd [r2], m2 + pextrd [r2 + r3], m2, 1 + lea r2, [r2 + 2 * r3] + pextrd [r2], m2, 2 + pextrd [r2 + r3], m2, 3 + RET %macro FILTER_VER_CHROMA_AVX2_4x4 1 INIT_YMM avx2 cglobal interp_4tap_vert_%1_4x4, 4, 6, 3 @@ -3148,8 +4881,8 @@ %endif RET %endmacro -FILTER_VER_CHROMA_AVX2_4x4 pp -FILTER_VER_CHROMA_AVX2_4x4 ps + FILTER_VER_CHROMA_AVX2_4x4 pp + FILTER_VER_CHROMA_AVX2_4x4 ps %macro FILTER_VER_CHROMA_AVX2_4x8 1 INIT_YMM avx2 @@ -3235,8 +4968,8 @@ RET %endmacro -FILTER_VER_CHROMA_AVX2_4x8 pp -FILTER_VER_CHROMA_AVX2_4x8 ps + FILTER_VER_CHROMA_AVX2_4x8 pp + FILTER_VER_CHROMA_AVX2_4x8 ps %macro FILTER_VER_CHROMA_AVX2_4x16 1 INIT_YMM avx2 @@ -3380,8 +5113,8 @@ %endif %endmacro -FILTER_VER_CHROMA_AVX2_4x16 pp -FILTER_VER_CHROMA_AVX2_4x16 ps + FILTER_VER_CHROMA_AVX2_4x16 pp + FILTER_VER_CHROMA_AVX2_4x16 ps ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -3390,184 +5123,184 @@ INIT_XMM sse4 cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8 -mov r4d, r4m -sub r0, r1 + mov r4d, r4m + sub r0, r1 %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd m0, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] %else -movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [tab_ChromaCoeff + r4 * 4] %endif -pshufb m0, [tab_Cm] + pshufb m0, [tab_Cm] -mova m1, [pw_512] + mova m1, [pw_512] -mov r4d, %2 + mov r4d, %2 -lea r5, [3 * r1] + lea r5, [3 * r1] .loop: -movd m2, [r0] -movd m3, [r0 + r1] -movd m4, [r0 + 2 * r1] -movd m5, [r0 + r5] + movd m2, [r0] + movd m3, [r0 + r1] + movd m4, [r0 + 2 * r1] + movd m5, [r0 + r5] -punpcklbw m2, m3 -punpcklbw m6, m4, m5 -punpcklbw m2, m6 + punpcklbw m2, m3 + punpcklbw m6, m4, m5 + punpcklbw m2, m6 -pmaddubsw m2, m0 + pmaddubsw m2, m0 -lea r0, [r0 + 4 * r1] -movd m6, [r0] + lea r0, [r0 + 4 * r1] + movd m6, [r0] -punpcklbw m3, m4 -punpcklbw m7, m5, m6 -punpcklbw m3, m7 + punpcklbw m3, m4 + punpcklbw m7, m5, m6 + punpcklbw m3, m7 -pmaddubsw m3, m0 + pmaddubsw m3, m0 -phaddw m2, m3 + phaddw m2, m3 -pmulhrsw m2, m1 + pmulhrsw m2, m1 -movd m7, [r0 + r1] + movd m7, [r0 + r1] -punpcklbw m4, m5 -punpcklbw m3, m6, m7 -punpcklbw m4, m3 + punpcklbw m4, m5 + punpcklbw m3, m6, m7 + punpcklbw m4, m3 -pmaddubsw m4, m0 + pmaddubsw m4, m0 -movd m3, [r0 + 2 * r1] + movd m3, [r0 + 2 * r1] -punpcklbw m5, m6 -punpcklbw m7, m3 -punpcklbw m5, m7 + punpcklbw m5, m6 + punpcklbw m7, m3 + punpcklbw m5, m7 -pmaddubsw m5, m0 + pmaddubsw m5, m0 -phaddw m4, m5 + phaddw m4, m5 -pmulhrsw m4, m1 -packuswb m2, m4 -movd [r2], m2 -pextrd [r2 + r3], m2, 1 -lea r2, [r2 + 2 * r3] -pextrd [r2], m2, 2 -pextrd [r2 + r3], m2, 3 + pmulhrsw m4, m1 + packuswb m2, m4 + movd [r2], m2 + pextrd [r2 + r3], m2, 1 + lea r2, [r2 + 2 * r3] + pextrd [r2], m2, 2 + pextrd [r2 + r3], m2, 3 -lea r2, [r2 + 2 * r3] + lea r2, [r2 + 2 * r3] -sub r4, 4 -jnz .loop -RET + sub r4, 4 + jnz .loop + RET %endmacro -FILTER_V4_W4_H4 4, 8 -FILTER_V4_W4_H4 4, 16 + FILTER_V4_W4_H4 4, 8 + FILTER_V4_W4_H4 4, 16 -FILTER_V4_W4_H4 4, 32 + FILTER_V4_W4_H4 4, 32 %macro FILTER_V4_W8_H2 0 -punpcklbw m1, m2 -punpcklbw m7, m3, m0 + punpcklbw m1, m2 + punpcklbw m7, m3, m0 -pmaddubsw m1, m6 -pmaddubsw m7, m5 + pmaddubsw m1, m6 + pmaddubsw m7, m5 -paddw m1, m7 + paddw m1, m7 -pmulhrsw m1, m4 -packuswb m1, m1 + pmulhrsw m1, m4 + packuswb m1, m1 %endmacro %macro FILTER_V4_W8_H3 0 -punpcklbw m2, m3 -punpcklbw m7, m0, m1 + punpcklbw m2, m3 + punpcklbw m7, m0, m1 -pmaddubsw m2, m6 -pmaddubsw m7, m5 + pmaddubsw m2, m6 + pmaddubsw m7, m5 -paddw m2, m7 + paddw m2, m7 -pmulhrsw m2, m4 -packuswb m2, m2 + pmulhrsw m2, m4 + packuswb m2, m2 %endmacro %macro FILTER_V4_W8_H4 0 -punpcklbw m3, m0 -punpcklbw m7, m1, m2 + punpcklbw m3, m0 + punpcklbw m7, m1, m2 -pmaddubsw m3, m6 -pmaddubsw m7, m5 + pmaddubsw m3, m6 + pmaddubsw m7, m5 -paddw m3, m7 + paddw m3, m7 -pmulhrsw m3, m4 -packuswb m3, m3 + pmulhrsw m3, m4 + packuswb m3, m3 %endmacro %macro FILTER_V4_W8_H5 0 -punpcklbw m0, m1 -punpcklbw m7, m2, m3 + punpcklbw m0, m1 + punpcklbw m7, m2, m3 -pmaddubsw m0, m6 -pmaddubsw m7, m5 + pmaddubsw m0, m6 + pmaddubsw m7, m5 -paddw m0, m7 + paddw m0, m7 -pmulhrsw m0, m4 -packuswb m0, m0 + pmulhrsw m0, m4 + packuswb m0, m0 %endmacro %macro FILTER_V4_W8_8x2 2 -FILTER_V4_W8 %1, %2 -movq m0, [r0 + 4 * r1] + FILTER_V4_W8 %1, %2 + movq m0, [r0 + 4 * r1] -FILTER_V4_W8_H2 + FILTER_V4_W8_H2 -movh [r2 + r3], m1 + movh [r2 + r3], m1 %endmacro %macro FILTER_V4_W8_8x4 2 -FILTER_V4_W8_8x2 %1, %2 + FILTER_V4_W8_8x2 %1, %2 ;8x3 -lea r6, [r0 + 4 * r1] -movq m1, [r6 + r1] + lea r6, [r0 + 4 * r1] + movq m1, [r6 + r1] -FILTER_V4_W8_H3 + FILTER_V4_W8_H3 -movh [r2 + 2 * r3], m2 + movh [r2 + 2 * r3], m2 ;8x4 -movq m2, [r6 + 2 * r1] + movq m2, [r6 + 2 * r1] -FILTER_V4_W8_H4 + FILTER_V4_W8_H4 -lea r5, [r2 + 2 * r3] -movh [r5 + r3], m3 + lea r5, [r2 + 2 * r3] + movh [r5 + r3], m3 %endmacro %macro FILTER_V4_W8_8x6 2 -FILTER_V4_W8_8x4 %1, %2 + FILTER_V4_W8_8x4 %1, %2 ;8x5 -lea r6, [r6 + 2 * r1] -movq m3, [r6 + r1] + lea r6, [r6 + 2 * r1] + movq m3, [r6 + r1] -FILTER_V4_W8_H5 + FILTER_V4_W8_H5 -movh [r2 + 4 * r3], m0 + movh [r2 + 4 * r3], m0 ;8x6 -movq m0, [r0 + 8 * r1] + movq m0, [r0 + 8 * r1] -FILTER_V4_W8_H2 + FILTER_V4_W8_H2 -lea r5, [r2 + 4 * r3] -movh [r5 + r3], m1 + lea r5, [r2 + 4 * r3] + movh [r5 + r3], m1 %endmacro ;----------------------------------------------------------------------------- @@ -3577,60 +5310,60 @@ INIT_XMM sse4 cglobal interp_4tap_vert_pp_%1x%2, 4, 7, 8 -mov r4d, r4m + mov r4d, r4m -sub r0, r1 -movq m0, [r0] -movq m1, [r0 + r1] -movq m2, [r0 + 2 * r1] -lea r5, [r0 + 2 * r1] -movq m3, [r5 + r1] + sub r0, r1 + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + lea r5, [r0 + 2 * r1] + movq m3, [r5 + r1] -punpcklbw m0, m1 -punpcklbw m4, m2, m3 + punpcklbw m0, m1 + punpcklbw m4, m2, m3 %ifdef PIC -lea r6, [tab_ChromaCoeff] -movd m5, [r6 + r4 * 4] + lea r6, [tab_ChromaCoeff] + movd m5, [r6 + r4 * 4] %else -movd m5, [tab_ChromaCoeff + r4 * 4] + movd m5, [tab_ChromaCoeff + r4 * 4] %endif -pshufb m6, m5, [tab_Vm] -pmaddubsw m0, m6 + pshufb m6, m5, [tab_Vm] + pmaddubsw m0, m6 -pshufb m5, [tab_Vm + 16] -pmaddubsw m4, m5 + pshufb m5, [tab_Vm + 16] + pmaddubsw m4, m5 -paddw m0, m4 + paddw m0, m4 -mova m4, [pw_512] + mova m4, [pw_512] -pmulhrsw m0, m4 -packuswb m0, m0 -movh [r2], m0 + pmulhrsw m0, m4 + packuswb m0, m0 + movh [r2], m0 %endmacro ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_8x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- -FILTER_V4_W8_8x2 8, 2 + FILTER_V4_W8_8x2 8, 2 -RET + RET ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_8x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- -FILTER_V4_W8_8x4 8, 4 + FILTER_V4_W8_8x4 8, 4 -RET + RET ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_8x6(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- -FILTER_V4_W8_8x6 8, 6 + FILTER_V4_W8_8x6 8, 6 -RET + RET ;------------------------------------------------------------------------------------------------------------- ; void interp_4tap_vert_ps_4x2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) @@ -3638,46 +5371,46 @@ INIT_XMM sse4 cglobal interp_4tap_vert_ps_4x2, 4, 6, 6 -mov r4d, r4m -sub r0, r1 -add r3d, r3d + mov r4d, r4m + sub r0, r1 + add r3d, r3d %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd m0, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] %else -movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [tab_ChromaCoeff + r4 * 4] %endif -pshufb m0, [tab_Cm] + pshufb m0, [tab_Cm] -movd m2, [r0] -movd m3, [r0 + r1] -lea r5, [r0 + 2 * r1] -movd m4, [r5] -movd m5, [r5 + r1] + movd m2, [r0] + movd m3, [r0 + r1] + lea r5, [r0 + 2 * r1] + movd m4, [r5] + movd m5, [r5 + r1] -punpcklbw m2, m3 -punpcklbw m1, m4, m5 -punpcklbw m2, m1 + punpcklbw m2, m3 + punpcklbw m1, m4, m5 + punpcklbw m2, m1 -pmaddubsw m2, m0 + pmaddubsw m2, m0 -movd m1, [r0 + 4 * r1] + movd m1, [r0 + 4 * r1] -punpcklbw m3, m4 -punpcklbw m5, m1 -punpcklbw m3, m5 + punpcklbw m3, m4 + punpcklbw m5, m1 + punpcklbw m3, m5 -pmaddubsw m3, m0 + pmaddubsw m3, m0 -phaddw m2, m3 + phaddw m2, m3 -psubw m2, [pw_2000] -movh [r2], m2 -movhps [r2 + r3], m2 + psubw m2, [pw_2000] + movh [r2], m2 + movhps [r2 + r3], m2 -RET + RET ;------------------------------------------------------------------------------------------------------------- ; void interp_4tap_vert_ps_4x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) @@ -3835,10 +5568,10 @@ RET %endmacro -FILTER_V_PS_W4_H4 4, 8 -FILTER_V_PS_W4_H4 4, 16 + FILTER_V_PS_W4_H4 4, 8 + FILTER_V_PS_W4_H4 4, 16 -FILTER_V_PS_W4_H4 4, 32 + FILTER_V_PS_W4_H4 4, 32 ;-------------------------------------------------------------------------------------------------------------- ; void interp_4tap_vert_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) @@ -3904,12 +5637,12 @@ RET %endmacro -FILTER_V_PS_W8_H8_H16_H2 8, 2 -FILTER_V_PS_W8_H8_H16_H2 8, 4 -FILTER_V_PS_W8_H8_H16_H2 8, 6 + FILTER_V_PS_W8_H8_H16_H2 8, 2 + FILTER_V_PS_W8_H8_H16_H2 8, 4 + FILTER_V_PS_W8_H8_H16_H2 8, 6 -FILTER_V_PS_W8_H8_H16_H2 8, 12 -FILTER_V_PS_W8_H8_H16_H2 8, 64 + FILTER_V_PS_W8_H8_H16_H2 8, 12 + FILTER_V_PS_W8_H8_H16_H2 8, 64 ;-------------------------------------------------------------------------------------------------------------- ; void interp_4tap_vert_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) @@ -3999,9 +5732,9 @@ RET %endmacro -FILTER_V_PS_W8_H8_H16_H32 8, 8 -FILTER_V_PS_W8_H8_H16_H32 8, 16 -FILTER_V_PS_W8_H8_H16_H32 8, 32 + FILTER_V_PS_W8_H8_H16_H32 8, 8 + FILTER_V_PS_W8_H8_H16_H32 8, 16 + FILTER_V_PS_W8_H8_H16_H32 8, 32 ;------------------------------------------------------------------------------------------------------------ ;void interp_4tap_vert_ps_6x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) @@ -4095,8 +5828,8 @@ RET %endmacro -FILTER_V_PS_W6 6, 8 -FILTER_V_PS_W6 6, 16 + FILTER_V_PS_W6 6, 8 + FILTER_V_PS_W6 6, 16 ;--------------------------------------------------------------------------------------------------------------- ; void interp_4tap_vert_ps_12x16(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) @@ -4181,8 +5914,8 @@ RET %endmacro -FILTER_V_PS_W12 12, 16 -FILTER_V_PS_W12 12, 32 + FILTER_V_PS_W12 12, 16 + FILTER_V_PS_W12 12, 32 ;--------------------------------------------------------------------------------------------------------------- ; void interp_4tap_vert_ps_16x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) @@ -4266,14 +5999,14 @@ RET %endmacro -FILTER_V_PS_W16 16, 4 -FILTER_V_PS_W16 16, 8 -FILTER_V_PS_W16 16, 12 -FILTER_V_PS_W16 16, 16 -FILTER_V_PS_W16 16, 32 + FILTER_V_PS_W16 16, 4 + FILTER_V_PS_W16 16, 8 + FILTER_V_PS_W16 16, 12 + FILTER_V_PS_W16 16, 16 + FILTER_V_PS_W16 16, 32 -FILTER_V_PS_W16 16, 24 -FILTER_V_PS_W16 16, 64 + FILTER_V_PS_W16 16, 24 + FILTER_V_PS_W16 16, 64 ;-------------------------------------------------------------------------------------------------------------- ;void interp_4tap_vert_ps_24x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) @@ -4389,9 +6122,9 @@ RET %endmacro -FILTER_V4_PS_W24 24, 32 + FILTER_V4_PS_W24 24, 32 -FILTER_V4_PS_W24 24, 64 + FILTER_V4_PS_W24 24, 64 ;--------------------------------------------------------------------------------------------------------------- ; void interp_4tap_vert_ps_32x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) @@ -4482,13 +6215,13 @@ RET %endmacro -FILTER_V_PS_W32 32, 8 -FILTER_V_PS_W32 32, 16 -FILTER_V_PS_W32 32, 24 -FILTER_V_PS_W32 32, 32 + FILTER_V_PS_W32 32, 8 + FILTER_V_PS_W32 32, 16 + FILTER_V_PS_W32 32, 24 + FILTER_V_PS_W32 32, 32 -FILTER_V_PS_W32 32, 48 -FILTER_V_PS_W32 32, 64 + FILTER_V_PS_W32 32, 48 + FILTER_V_PS_W32 32, 64 ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -4497,95 +6230,95 @@ INIT_XMM sse4 cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8 -mov r4d, r4m -sub r0, r1 + mov r4d, r4m + sub r0, r1 %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd m5, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd m5, [r5 + r4 * 4] %else -movd m5, [tab_ChromaCoeff + r4 * 4] + movd m5, [tab_ChromaCoeff + r4 * 4] %endif -pshufb m6, m5, [tab_Vm] -pshufb m5, [tab_Vm + 16] -mova m4, [pw_512] -lea r5, [r1 * 3] + pshufb m6, m5, [tab_Vm] + pshufb m5, [tab_Vm + 16] + mova m4, [pw_512] + lea r5, [r1 * 3] -mov r4d, %2 + mov r4d, %2 .loop: -movq m0, [r0] -movq m1, [r0 + r1] -movq m2, [r0 + 2 * r1] -movq m3, [r0 + r5] + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + movq m3, [r0 + r5] -punpcklbw m0, m1 -punpcklbw m1, m2 -punpcklbw m2, m3 + punpcklbw m0, m1 + punpcklbw m1, m2 + punpcklbw m2, m3 -pmaddubsw m0, m6 -pmaddubsw m7, m2, m5 + pmaddubsw m0, m6 + pmaddubsw m7, m2, m5 -paddw m0, m7 + paddw m0, m7 -pmulhrsw m0, m4 -packuswb m0, m0 -movh [r2], m0 + pmulhrsw m0, m4 + packuswb m0, m0 + movh [r2], m0 -lea r0, [r0 + 4 * r1] -movq m0, [r0] + lea r0, [r0 + 4 * r1] + movq m0, [r0] -punpcklbw m3, m0 + punpcklbw m3, m0 -pmaddubsw m1, m6 -pmaddubsw m7, m3, m5 + pmaddubsw m1, m6 + pmaddubsw m7, m3, m5 -paddw m1, m7 + paddw m1, m7 -pmulhrsw m1, m4 -packuswb m1, m1 -movh [r2 + r3], m1 + pmulhrsw m1, m4 + packuswb m1, m1 + movh [r2 + r3], m1 -movq m1, [r0 + r1] + movq m1, [r0 + r1] -punpcklbw m0, m1 + punpcklbw m0, m1 -pmaddubsw m2, m6 -pmaddubsw m0, m5 + pmaddubsw m2, m6 + pmaddubsw m0, m5 -paddw m2, m0 + paddw m2, m0 -pmulhrsw m2, m4 + pmulhrsw m2, m4 -movq m7, [r0 + 2 * r1] -punpcklbw m1, m7 + movq m7, [r0 + 2 * r1] + punpcklbw m1, m7 -pmaddubsw m3, m6 -pmaddubsw m1, m5 + pmaddubsw m3, m6 + pmaddubsw m1, m5 -paddw m3, m1 + paddw m3, m1 -pmulhrsw m3, m4 -packuswb m2, m3 + pmulhrsw m3, m4 + packuswb m2, m3 -lea r2, [r2 + 2 * r3] -movh [r2], m2 -movhps [r2 + r3], m2 + lea r2, [r2 + 2 * r3] + movh [r2], m2 + movhps [r2 + r3], m2 -lea r2, [r2 + 2 * r3] + lea r2, [r2 + 2 * r3] -sub r4, 4 -jnz .loop -RET + sub r4, 4 + jnz .loop + RET %endmacro -FILTER_V4_W8_H8_H16_H32 8, 8 -FILTER_V4_W8_H8_H16_H32 8, 16 -FILTER_V4_W8_H8_H16_H32 8, 32 + FILTER_V4_W8_H8_H16_H32 8, 8 + FILTER_V4_W8_H8_H16_H32 8, 16 + FILTER_V4_W8_H8_H16_H32 8, 32 -FILTER_V4_W8_H8_H16_H32 8, 12 -FILTER_V4_W8_H8_H16_H32 8, 64 + FILTER_V4_W8_H8_H16_H32 8, 12 + FILTER_V4_W8_H8_H16_H32 8, 64 %macro PROCESS_CHROMA_AVX2_W8_8R 0 movq xm1, [r0] ; m1 = row 0 @@ -4691,8 +6424,8 @@ RET %endmacro -FILTER_VER_CHROMA_AVX2_8x8 pp -FILTER_VER_CHROMA_AVX2_8x8 ps + FILTER_VER_CHROMA_AVX2_8x8 pp + FILTER_VER_CHROMA_AVX2_8x8 ps %macro FILTER_VER_CHROMA_AVX2_8x6 1 INIT_YMM avx2 @@ -4780,8 +6513,8 @@ RET %endmacro -FILTER_VER_CHROMA_AVX2_8x6 pp -FILTER_VER_CHROMA_AVX2_8x6 ps + FILTER_VER_CHROMA_AVX2_8x6 pp + FILTER_VER_CHROMA_AVX2_8x6 ps %macro PROCESS_CHROMA_AVX2_W8_16R 1 movq xm1, [r0] ; m1 = row 0 @@ -4961,12 +6694,154 @@ RET %endmacro -FILTER_VER_CHROMA_AVX2_8x16 pp -FILTER_VER_CHROMA_AVX2_8x16 ps + FILTER_VER_CHROMA_AVX2_8x16 pp + FILTER_VER_CHROMA_AVX2_8x16 ps + +%macro FILTER_VER_CHROMA_AVX2_8x12 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x12, 4, 7, 8 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1, pp + mova m7, [pw_512] +%else + add r3d, r3d + mova m7, [pw_2000] +%endif + lea r6, [r3 * 3] + movq xm1, [r0] ; m1 = row 0 + movq xm2, [r0 + r1] ; m2 = row 1 + punpcklbw xm1, xm2 + movq xm3, [r0 + r1 * 2] ; m3 = row 2 + punpcklbw xm2, xm3 + vinserti128 m5, m1, xm2, 1 + pmaddubsw m5, [r5] + movq xm4, [r0 + r4] ; m4 = row 3 + punpcklbw xm3, xm4 + lea r0, [r0 + r1 * 4] + movq xm1, [r0] ; m1 = row 4 + punpcklbw xm4, xm1 + vinserti128 m2, m3, xm4, 1 + pmaddubsw m0, m2, [r5 + 1 * mmsize] + paddw m5, m0 + pmaddubsw m2, [r5] + movq xm3, [r0 + r1] ; m3 = row 5 + punpcklbw xm1, xm3 + movq xm4, [r0 + r1 * 2] ; m4 = row 6 + punpcklbw xm3, xm4 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m0, m1, [r5 + 1 * mmsize] + paddw m2, m0 + pmaddubsw m1, [r5] + movq xm3, [r0 + r4] ; m3 = row 7 + punpcklbw xm4, xm3 + lea r0, [r0 + r1 * 4] + movq xm0, [r0] ; m0 = row 8 + punpcklbw xm3, xm0 + vinserti128 m4, m4, xm3, 1 + pmaddubsw m3, m4, [r5 + 1 * mmsize] + paddw m1, m3 + pmaddubsw m4, [r5] + movq xm3, [r0 + r1] ; m3 = row 9 + punpcklbw xm0, xm3 + movq xm6, [r0 + r1 * 2] ; m6 = row 10 + punpcklbw xm3, xm6 + vinserti128 m0, m0, xm3, 1 + pmaddubsw m3, m0, [r5 + 1 * mmsize] + paddw m4, m3 + pmaddubsw m0, [r5] +%ifidn %1, pp + pmulhrsw m5, m7 ; m5 = word: row 0, row 1 + pmulhrsw m2, m7 ; m2 = word: row 2, row 3 + pmulhrsw m1, m7 ; m1 = word: row 4, row 5 + pmulhrsw m4, m7 ; m4 = word: row 6, row 7 + packuswb m5, m2 + packuswb m1, m4 + vextracti128 xm2, m5, 1 + vextracti128 xm4, m1, 1 + movq [r2], xm5 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm5 + movhps [r2 + r6], xm2 + lea r2, [r2 + r3 * 4] + movq [r2], xm1 + movq [r2 + r3], xm4 + movhps [r2 + r3 * 2], xm1 + movhps [r2 + r6], xm4 +%else + psubw m5, m7 ; m5 = word: row 0, row 1 + psubw m2, m7 ; m2 = word: row 2, row 3 + psubw m1, m7 ; m1 = word: row 4, row 5 + psubw m4, m7 ; m4 = word: row 6, row 7 + vextracti128 xm3, m5, 1 + movu [r2], xm5 + movu [r2 + r3], xm3 + vextracti128 xm3, m2, 1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm3 + lea r2, [r2 + r3 * 4] + vextracti128 xm5, m1, 1 + vextracti128 xm3, m4, 1 + movu [r2], xm1 + movu [r2 + r3], xm5 + movu [r2 + r3 * 2], xm4 + movu [r2 + r6], xm3 +%endif + movq xm3, [r0 + r4] ; m3 = row 11 + punpcklbw xm6, xm3 + lea r0, [r0 + r1 * 4] + movq xm5, [r0] ; m5 = row 12 + punpcklbw xm3, xm5 + vinserti128 m6, m6, xm3, 1 + pmaddubsw m3, m6, [r5 + 1 * mmsize] + paddw m0, m3 + pmaddubsw m6, [r5] + movq xm3, [r0 + r1] ; m3 = row 13 + punpcklbw xm5, xm3 + movq xm2, [r0 + r1 * 2] ; m2 = row 14 + punpcklbw xm3, xm2 + vinserti128 m5, m5, xm3, 1 + pmaddubsw m3, m5, [r5 + 1 * mmsize] + paddw m6, m3 + lea r2, [r2 + r3 * 4] +%ifidn %1, pp + pmulhrsw m0, m7 ; m0 = word: row 8, row 9 + pmulhrsw m6, m7 ; m6 = word: row 10, row 11 + packuswb m0, m6 + vextracti128 xm6, m0, 1 + movq [r2], xm0 + movq [r2 + r3], xm6 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r6], xm6 +%else + psubw m0, m7 ; m0 = word: row 8, row 9 + psubw m6, m7 ; m6 = word: row 10, row 11 + vextracti128 xm1, m0, 1 + vextracti128 xm3, m6, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm6 + movu [r2 + r6], xm3 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_8x12 pp + FILTER_VER_CHROMA_AVX2_8x12 ps -%macro FILTER_VER_CHROMA_AVX2_8x32 1 +%macro FILTER_VER_CHROMA_AVX2_8xN 2 INIT_YMM avx2 -cglobal interp_4tap_vert_%1_8x32, 4, 7, 8 +cglobal interp_4tap_vert_%1_8x%2, 4, 7, 8 mov r4d, r4m shl r4d, 6 @@ -4986,15 +6861,17 @@ mova m7, [pw_2000] %endif lea r6, [r3 * 3] -%rep 2 +%rep %2 / 16 PROCESS_CHROMA_AVX2_W8_16R %1 lea r2, [r2 + r3 * 4] %endrep RET %endmacro -FILTER_VER_CHROMA_AVX2_8x32 pp -FILTER_VER_CHROMA_AVX2_8x32 ps + FILTER_VER_CHROMA_AVX2_8xN pp, 32 + FILTER_VER_CHROMA_AVX2_8xN ps, 32 + FILTER_VER_CHROMA_AVX2_8xN pp, 64 + FILTER_VER_CHROMA_AVX2_8xN ps, 64 %macro PROCESS_CHROMA_AVX2_W8_4R 0 movq xm1, [r0] ; m1 = row 0 @@ -5065,8 +6942,8 @@ RET %endmacro -FILTER_VER_CHROMA_AVX2_8x4 pp -FILTER_VER_CHROMA_AVX2_8x4 ps + FILTER_VER_CHROMA_AVX2_8x4 pp + FILTER_VER_CHROMA_AVX2_8x4 ps %macro FILTER_VER_CHROMA_AVX2_8x2 1 INIT_YMM avx2 @@ -5114,8 +6991,8 @@ RET %endmacro -FILTER_VER_CHROMA_AVX2_8x2 pp -FILTER_VER_CHROMA_AVX2_8x2 ps + FILTER_VER_CHROMA_AVX2_8x2 pp + FILTER_VER_CHROMA_AVX2_8x2 ps %macro FILTER_VER_CHROMA_AVX2_6x8 1 INIT_YMM avx2 @@ -5194,8 +7071,8 @@ RET %endmacro -FILTER_VER_CHROMA_AVX2_6x8 pp -FILTER_VER_CHROMA_AVX2_6x8 ps + FILTER_VER_CHROMA_AVX2_6x8 pp + FILTER_VER_CHROMA_AVX2_6x8 ps ;----------------------------------------------------------------------------- ;void interp_4tap_vert_pp_6x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -5204,96 +7081,96 @@ INIT_XMM sse4 cglobal interp_4tap_vert_pp_6x%2, 4, 6, 8 -mov r4d, r4m -sub r0, r1 + mov r4d, r4m + sub r0, r1 %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd m5, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd m5, [r5 + r4 * 4] %else -movd m5, [tab_ChromaCoeff + r4 * 4] + movd m5, [tab_ChromaCoeff + r4 * 4] %endif -pshufb m6, m5, [tab_Vm] -pshufb m5, [tab_Vm + 16] -mova m4, [pw_512] + pshufb m6, m5, [tab_Vm] + pshufb m5, [tab_Vm + 16] + mova m4, [pw_512] -mov r4d, %2 -lea r5, [3 * r1] + mov r4d, %2 + lea r5, [3 * r1] .loop: -movq m0, [r0] -movq m1, [r0 + r1] -movq m2, [r0 + 2 * r1] -movq m3, [r0 + r5] + movq m0, [r0] + movq m1, [r0 + r1] + movq m2, [r0 + 2 * r1] + movq m3, [r0 + r5] -punpcklbw m0, m1 -punpcklbw m1, m2 -punpcklbw m2, m3 + punpcklbw m0, m1 + punpcklbw m1, m2 + punpcklbw m2, m3 -pmaddubsw m0, m6 -pmaddubsw m7, m2, m5 + pmaddubsw m0, m6 + pmaddubsw m7, m2, m5 -paddw m0, m7 + paddw m0, m7 -pmulhrsw m0, m4 -packuswb m0, m0 -movd [r2], m0 -pextrw [r2 + 4], m0, 2 + pmulhrsw m0, m4 + packuswb m0, m0 + movd [r2], m0 + pextrw [r2 + 4], m0, 2 -lea r0, [r0 + 4 * r1] + lea r0, [r0 + 4 * r1] -movq m0, [r0] -punpcklbw m3, m0 + movq m0, [r0] + punpcklbw m3, m0 -pmaddubsw m1, m6 -pmaddubsw m7, m3, m5 + pmaddubsw m1, m6 + pmaddubsw m7, m3, m5 -paddw m1, m7 + paddw m1, m7 -pmulhrsw m1, m4 -packuswb m1, m1 -movd [r2 + r3], m1 -pextrw [r2 + r3 + 4], m1, 2 + pmulhrsw m1, m4 + packuswb m1, m1 + movd [r2 + r3], m1 + pextrw [r2 + r3 + 4], m1, 2 -movq m1, [r0 + r1] -punpcklbw m7, m0, m1 + movq m1, [r0 + r1] + punpcklbw m7, m0, m1 -pmaddubsw m2, m6 -pmaddubsw m7, m5 + pmaddubsw m2, m6 + pmaddubsw m7, m5 -paddw m2, m7 + paddw m2, m7 -pmulhrsw m2, m4 -packuswb m2, m2 -lea r2, [r2 + 2 * r3] -movd [r2], m2 -pextrw [r2 + 4], m2, 2 + pmulhrsw m2, m4 + packuswb m2, m2 + lea r2, [r2 + 2 * r3] + movd [r2], m2 + pextrw [r2 + 4], m2, 2 -movq m2, [r0 + 2 * r1] -punpcklbw m1, m2 + movq m2, [r0 + 2 * r1] + punpcklbw m1, m2 -pmaddubsw m3, m6 -pmaddubsw m1, m5 + pmaddubsw m3, m6 + pmaddubsw m1, m5 -paddw m3, m1 + paddw m3, m1 -pmulhrsw m3, m4 -packuswb m3, m3 + pmulhrsw m3, m4 + packuswb m3, m3 -movd [r2 + r3], m3 -pextrw [r2 + r3 + 4], m3, 2 + movd [r2 + r3], m3 + pextrw [r2 + r3 + 4], m3, 2 -lea r2, [r2 + 2 * r3] + lea r2, [r2 + 2 * r3] -sub r4, 4 -jnz .loop -RET + sub r4, 4 + jnz .loop + RET %endmacro -FILTER_V4_W6_H4 6, 8 + FILTER_V4_W6_H4 6, 8 -FILTER_V4_W6_H4 6, 16 + FILTER_V4_W6_H4 6, 16 ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -5302,88 +7179,88 @@ INIT_XMM sse4 cglobal interp_4tap_vert_pp_12x%2, 4, 6, 8 -mov r4d, r4m -sub r0, r1 + mov r4d, r4m + sub r0, r1 %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd m0, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] %else -movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [tab_ChromaCoeff + r4 * 4] %endif -pshufb m1, m0, [tab_Vm] -pshufb m0, [tab_Vm + 16] + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] -mov r4d, %2 + mov r4d, %2 .loop: -movu m2, [r0] -movu m3, [r0 + r1] + movu m2, [r0] + movu m3, [r0 + r1] -punpcklbw m4, m2, m3 -punpckhbw m2, m3 + punpcklbw m4, m2, m3 + punpckhbw m2, m3 -pmaddubsw m4, m1 -pmaddubsw m2, m1 + pmaddubsw m4, m1 + pmaddubsw m2, m1 -lea r0, [r0 + 2 * r1] -movu m5, [r0] -movu m7, [r0 + r1] + lea r0, [r0 + 2 * r1] + movu m5, [r0] + movu m7, [r0 + r1] -punpcklbw m6, m5, m7 -pmaddubsw m6, m0 -paddw m4, m6 + punpcklbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m4, m6 -punpckhbw m6, m5, m7 -pmaddubsw m6, m0 -paddw m2, m6 + punpckhbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m2, m6 -mova m6, [pw_512] + mova m6, [pw_512] -pmulhrsw m4, m6 -pmulhrsw m2, m6 + pmulhrsw m4, m6 + pmulhrsw m2, m6 -packuswb m4, m2 + packuswb m4, m2 -movh [r2], m4 -pextrd [r2 + 8], m4, 2 + movh [r2], m4 + pextrd [r2 + 8], m4, 2 -punpcklbw m4, m3, m5 -punpckhbw m3, m5 + punpcklbw m4, m3, m5 + punpckhbw m3, m5 -pmaddubsw m4, m1 -pmaddubsw m3, m1 + pmaddubsw m4, m1 + pmaddubsw m3, m1 -movu m5, [r0 + 2 * r1] + movu m5, [r0 + 2 * r1] -punpcklbw m2, m7, m5 -punpckhbw m7, m5 + punpcklbw m2, m7, m5 + punpckhbw m7, m5 -pmaddubsw m2, m0 -pmaddubsw m7, m0 + pmaddubsw m2, m0 + pmaddubsw m7, m0 -paddw m4, m2 -paddw m3, m7 + paddw m4, m2 + paddw m3, m7 -pmulhrsw m4, m6 -pmulhrsw m3, m6 + pmulhrsw m4, m6 + pmulhrsw m3, m6 -packuswb m4, m3 + packuswb m4, m3 -movh [r2 + r3], m4 -pextrd [r2 + r3 + 8], m4, 2 + movh [r2 + r3], m4 + pextrd [r2 + r3 + 8], m4, 2 -lea r2, [r2 + 2 * r3] + lea r2, [r2 + 2 * r3] -sub r4, 2 -jnz .loop -RET + sub r4, 2 + jnz .loop + RET %endmacro -FILTER_V4_W12_H2 12, 16 + FILTER_V4_W12_H2 12, 16 -FILTER_V4_W12_H2 12, 32 + FILTER_V4_W12_H2 12, 32 ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_16x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -5392,91 +7269,91 @@ INIT_XMM sse4 cglobal interp_4tap_vert_pp_16x%2, 4, 6, 8 -mov r4d, r4m -sub r0, r1 + mov r4d, r4m + sub r0, r1 %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd m0, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] %else -movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [tab_ChromaCoeff + r4 * 4] %endif -pshufb m1, m0, [tab_Vm] -pshufb m0, [tab_Vm + 16] + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] -mov r4d, %2/2 + mov r4d, %2/2 .loop: -movu m2, [r0] -movu m3, [r0 + r1] + movu m2, [r0] + movu m3, [r0 + r1] -punpcklbw m4, m2, m3 -punpckhbw m2, m3 + punpcklbw m4, m2, m3 + punpckhbw m2, m3 -pmaddubsw m4, m1 -pmaddubsw m2, m1 + pmaddubsw m4, m1 + pmaddubsw m2, m1 -lea r0, [r0 + 2 * r1] -movu m5, [r0] -movu m6, [r0 + r1] + lea r0, [r0 + 2 * r1] + movu m5, [r0] + movu m6, [r0 + r1] -punpckhbw m7, m5, m6 -pmaddubsw m7, m0 -paddw m2, m7 + punpckhbw m7, m5, m6 + pmaddubsw m7, m0 + paddw m2, m7 -punpcklbw m7, m5, m6 -pmaddubsw m7, m0 -paddw m4, m7 + punpcklbw m7, m5, m6 + pmaddubsw m7, m0 + paddw m4, m7 -mova m7, [pw_512] + mova m7, [pw_512] -pmulhrsw m4, m7 -pmulhrsw m2, m7 + pmulhrsw m4, m7 + pmulhrsw m2, m7 -packuswb m4, m2 + packuswb m4, m2 -movu [r2], m4 + movu [r2], m4 -punpcklbw m4, m3, m5 -punpckhbw m3, m5 + punpcklbw m4, m3, m5 + punpckhbw m3, m5 -pmaddubsw m4, m1 -pmaddubsw m3, m1 + pmaddubsw m4, m1 + pmaddubsw m3, m1 -movu m5, [r0 + 2 * r1] + movu m5, [r0 + 2 * r1] -punpcklbw m2, m6, m5 -punpckhbw m6, m5 + punpcklbw m2, m6, m5 + punpckhbw m6, m5 -pmaddubsw m2, m0 -pmaddubsw m6, m0 + pmaddubsw m2, m0 + pmaddubsw m6, m0 -paddw m4, m2 -paddw m3, m6 + paddw m4, m2 + paddw m3, m6 -pmulhrsw m4, m7 -pmulhrsw m3, m7 + pmulhrsw m4, m7 + pmulhrsw m3, m7 -packuswb m4, m3 + packuswb m4, m3 -movu [r2 + r3], m4 + movu [r2 + r3], m4 -lea r2, [r2 + 2 * r3] + lea r2, [r2 + 2 * r3] -dec r4d -jnz .loop -RET + dec r4d + jnz .loop + RET %endmacro -FILTER_V4_W16_H2 16, 4 -FILTER_V4_W16_H2 16, 8 -FILTER_V4_W16_H2 16, 12 -FILTER_V4_W16_H2 16, 16 -FILTER_V4_W16_H2 16, 32 + FILTER_V4_W16_H2 16, 4 + FILTER_V4_W16_H2 16, 8 + FILTER_V4_W16_H2 16, 12 + FILTER_V4_W16_H2 16, 16 + FILTER_V4_W16_H2 16, 32 -FILTER_V4_W16_H2 16, 24 -FILTER_V4_W16_H2 16, 64 + FILTER_V4_W16_H2 16, 24 + FILTER_V4_W16_H2 16, 64 %macro FILTER_VER_CHROMA_AVX2_16x16 1 INIT_YMM avx2 @@ -5736,8 +7613,8 @@ %endif %endmacro -FILTER_VER_CHROMA_AVX2_16x16 pp -FILTER_VER_CHROMA_AVX2_16x16 ps + FILTER_VER_CHROMA_AVX2_16x16 pp + FILTER_VER_CHROMA_AVX2_16x16 ps %macro FILTER_VER_CHROMA_AVX2_16x8 1 INIT_YMM avx2 cglobal interp_4tap_vert_%1_16x8, 4, 7, 7 @@ -5891,8 +7768,8 @@ RET %endmacro -FILTER_VER_CHROMA_AVX2_16x8 pp -FILTER_VER_CHROMA_AVX2_16x8 ps + FILTER_VER_CHROMA_AVX2_16x8 pp + FILTER_VER_CHROMA_AVX2_16x8 ps %macro FILTER_VER_CHROMA_AVX2_16x12 1 INIT_YMM avx2 @@ -6119,13 +7996,13 @@ %endif %endmacro -FILTER_VER_CHROMA_AVX2_16x12 pp -FILTER_VER_CHROMA_AVX2_16x12 ps + FILTER_VER_CHROMA_AVX2_16x12 pp + FILTER_VER_CHROMA_AVX2_16x12 ps -%macro FILTER_VER_CHROMA_AVX2_16x32 1 -INIT_YMM avx2 +%macro FILTER_VER_CHROMA_AVX2_16xN 2 %if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_16x32, 4, 8, 8 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_16x%2, 4, 8, 8 mov r4d, r4m shl r4d, 6 @@ -6145,7 +8022,7 @@ mova m7, [pw_2000] %endif lea r6, [r3 * 3] - mov r7d, 2 + mov r7d, %2 / 16 .loopH: movu xm0, [r0] vinserti128 m0, m0, [r0 + r1 * 2], 1 @@ -6412,8 +8289,381 @@ %endif %endmacro -FILTER_VER_CHROMA_AVX2_16x32 pp -FILTER_VER_CHROMA_AVX2_16x32 ps + FILTER_VER_CHROMA_AVX2_16xN pp, 32 + FILTER_VER_CHROMA_AVX2_16xN ps, 32 + FILTER_VER_CHROMA_AVX2_16xN pp, 64 + FILTER_VER_CHROMA_AVX2_16xN ps, 64 + +%macro FILTER_VER_CHROMA_AVX2_16x24 1 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_16x24, 4, 6, 15 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + mova m12, [r5] + mova m13, [r5 + mmsize] + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m14, [pw_512] +%else + add r3d, r3d + vbroadcasti128 m14, [pw_2000] +%endif + lea r5, [r3 * 3] + + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhbw xm2, xm0, xm1 + punpcklbw xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddubsw m0, m12 + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhbw xm3, xm1, xm2 + punpcklbw xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m1, m12 + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhbw xm4, xm2, xm3 + punpcklbw xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddubsw m4, m2, m13 + paddw m0, m4 + pmaddubsw m2, m12 + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhbw xm5, xm3, xm4 + punpcklbw xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddubsw m5, m3, m13 + paddw m1, m5 + pmaddubsw m3, m12 + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhbw xm6, xm4, xm5 + punpcklbw xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddubsw m6, m4, m13 + paddw m2, m6 + pmaddubsw m4, m12 + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhbw xm7, xm5, xm6 + punpcklbw xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddubsw m7, m5, m13 + paddw m3, m7 + pmaddubsw m5, m12 + movu xm7, [r0 + r4] ; m7 = row 7 + punpckhbw xm8, xm6, xm7 + punpcklbw xm6, xm7 + vinserti128 m6, m6, xm8, 1 + pmaddubsw m8, m6, m13 + paddw m4, m8 + pmaddubsw m6, m12 + lea r0, [r0 + r1 * 4] + movu xm8, [r0] ; m8 = row 8 + punpckhbw xm9, xm7, xm8 + punpcklbw xm7, xm8 + vinserti128 m7, m7, xm9, 1 + pmaddubsw m9, m7, m13 + paddw m5, m9 + pmaddubsw m7, m12 + movu xm9, [r0 + r1] ; m9 = row 9 + punpckhbw xm10, xm8, xm9 + punpcklbw xm8, xm9 + vinserti128 m8, m8, xm10, 1 + pmaddubsw m10, m8, m13 + paddw m6, m10 + pmaddubsw m8, m12 + movu xm10, [r0 + r1 * 2] ; m10 = row 10 + punpckhbw xm11, xm9, xm10 + punpcklbw xm9, xm10 + vinserti128 m9, m9, xm11, 1 + pmaddubsw m11, m9, m13 + paddw m7, m11 + pmaddubsw m9, m12 + +%ifidn %1,pp + pmulhrsw m0, m14 ; m0 = word: row 0 + pmulhrsw m1, m14 ; m1 = word: row 1 + pmulhrsw m2, m14 ; m2 = word: row 2 + pmulhrsw m3, m14 ; m3 = word: row 3 + pmulhrsw m4, m14 ; m4 = word: row 4 + pmulhrsw m5, m14 ; m5 = word: row 5 + pmulhrsw m6, m14 ; m6 = word: row 6 + pmulhrsw m7, m14 ; m7 = word: row 7 + packuswb m0, m1 + packuswb m2, m3 + packuswb m4, m5 + packuswb m6, m7 + vpermq m0, m0, q3120 + vpermq m2, m2, q3120 + vpermq m4, m4, q3120 + vpermq m6, m6, q3120 + vextracti128 xm1, m0, 1 + vextracti128 xm3, m2, 1 + vextracti128 xm5, m4, 1 + vextracti128 xm7, m6, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r5], xm3 + lea r2, [r2 + r3 * 4] + movu [r2], xm4 + movu [r2 + r3], xm5 + movu [r2 + r3 * 2], xm6 + movu [r2 + r5], xm7 +%else + psubw m0, m14 ; m0 = word: row 0 + psubw m1, m14 ; m1 = word: row 1 + psubw m2, m14 ; m2 = word: row 2 + psubw m3, m14 ; m3 = word: row 3 + psubw m4, m14 ; m4 = word: row 4 + psubw m5, m14 ; m5 = word: row 5 + psubw m6, m14 ; m6 = word: row 6 + psubw m7, m14 ; m7 = word: row 7 + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r5], m3 + lea r2, [r2 + r3 * 4] + movu [r2], m4 + movu [r2 + r3], m5 + movu [r2 + r3 * 2], m6 + movu [r2 + r5], m7 +%endif + lea r2, [r2 + r3 * 4] + + movu xm11, [r0 + r4] ; m11 = row 11 + punpckhbw xm6, xm10, xm11 + punpcklbw xm10, xm11 + vinserti128 m10, m10, xm6, 1 + pmaddubsw m6, m10, m13 + paddw m8, m6 + pmaddubsw m10, m12 + lea r0, [r0 + r1 * 4] + movu xm6, [r0] ; m6 = row 12 + punpckhbw xm7, xm11, xm6 + punpcklbw xm11, xm6 + vinserti128 m11, m11, xm7, 1 + pmaddubsw m7, m11, m13 + paddw m9, m7 + pmaddubsw m11, m12 + + movu xm7, [r0 + r1] ; m7 = row 13 + punpckhbw xm0, xm6, xm7 + punpcklbw xm6, xm7 + vinserti128 m6, m6, xm0, 1 + pmaddubsw m0, m6, m13 + paddw m10, m0 + pmaddubsw m6, m12 + movu xm0, [r0 + r1 * 2] ; m0 = row 14 + punpckhbw xm1, xm7, xm0 + punpcklbw xm7, xm0 + vinserti128 m7, m7, xm1, 1 + pmaddubsw m1, m7, m13 + paddw m11, m1 + pmaddubsw m7, m12 + movu xm1, [r0 + r4] ; m1 = row 15 + punpckhbw xm2, xm0, xm1 + punpcklbw xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddubsw m2, m0, m13 + paddw m6, m2 + pmaddubsw m0, m12 + lea r0, [r0 + r1 * 4] + movu xm2, [r0] ; m2 = row 16 + punpckhbw xm3, xm1, xm2 + punpcklbw xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m3, m1, m13 + paddw m7, m3 + pmaddubsw m1, m12 + movu xm3, [r0 + r1] ; m3 = row 17 + punpckhbw xm4, xm2, xm3 + punpcklbw xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddubsw m4, m2, m13 + paddw m0, m4 + pmaddubsw m2, m12 + movu xm4, [r0 + r1 * 2] ; m4 = row 18 + punpckhbw xm5, xm3, xm4 + punpcklbw xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddubsw m5, m3, m13 + paddw m1, m5 + pmaddubsw m3, m12 + +%ifidn %1,pp + pmulhrsw m8, m14 ; m8 = word: row 8 + pmulhrsw m9, m14 ; m9 = word: row 9 + pmulhrsw m10, m14 ; m10 = word: row 10 + pmulhrsw m11, m14 ; m11 = word: row 11 + pmulhrsw m6, m14 ; m6 = word: row 12 + pmulhrsw m7, m14 ; m7 = word: row 13 + pmulhrsw m0, m14 ; m0 = word: row 14 + pmulhrsw m1, m14 ; m1 = word: row 15 + packuswb m8, m9 + packuswb m10, m11 + packuswb m6, m7 + packuswb m0, m1 + vpermq m8, m8, q3120 + vpermq m10, m10, q3120 + vpermq m6, m6, q3120 + vpermq m0, m0, q3120 + vextracti128 xm9, m8, 1 + vextracti128 xm11, m10, 1 + vextracti128 xm7, m6, 1 + vextracti128 xm1, m0, 1 + movu [r2], xm8 + movu [r2 + r3], xm9 + movu [r2 + r3 * 2], xm10 + movu [r2 + r5], xm11 + lea r2, [r2 + r3 * 4] + movu [r2], xm6 + movu [r2 + r3], xm7 + movu [r2 + r3 * 2], xm0 + movu [r2 + r5], xm1 +%else + psubw m8, m14 ; m8 = word: row 8 + psubw m9, m14 ; m9 = word: row 9 + psubw m10, m14 ; m10 = word: row 10 + psubw m11, m14 ; m11 = word: row 11 + psubw m6, m14 ; m6 = word: row 12 + psubw m7, m14 ; m7 = word: row 13 + psubw m0, m14 ; m0 = word: row 14 + psubw m1, m14 ; m1 = word: row 15 + movu [r2], m8 + movu [r2 + r3], m9 + movu [r2 + r3 * 2], m10 + movu [r2 + r5], m11 + lea r2, [r2 + r3 * 4] + movu [r2], m6 + movu [r2 + r3], m7 + movu [r2 + r3 * 2], m0 + movu [r2 + r5], m1 +%endif + lea r2, [r2 + r3 * 4] + + movu xm5, [r0 + r4] ; m5 = row 19 + punpckhbw xm6, xm4, xm5 + punpcklbw xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddubsw m6, m4, m13 + paddw m2, m6 + pmaddubsw m4, m12 + lea r0, [r0 + r1 * 4] + movu xm6, [r0] ; m6 = row 20 + punpckhbw xm7, xm5, xm6 + punpcklbw xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddubsw m7, m5, m13 + paddw m3, m7 + pmaddubsw m5, m12 + movu xm7, [r0 + r1] ; m7 = row 21 + punpckhbw xm0, xm6, xm7 + punpcklbw xm6, xm7 + vinserti128 m6, m6, xm0, 1 + pmaddubsw m0, m6, m13 + paddw m4, m0 + pmaddubsw m6, m12 + movu xm0, [r0 + r1 * 2] ; m0 = row 22 + punpckhbw xm1, xm7, xm0 + punpcklbw xm7, xm0 + vinserti128 m7, m7, xm1, 1 + pmaddubsw m1, m7, m13 + paddw m5, m1 + pmaddubsw m7, m12 + movu xm1, [r0 + r4] ; m1 = row 23 + punpckhbw xm8, xm0, xm1 + punpcklbw xm0, xm1 + vinserti128 m0, m0, xm8, 1 + pmaddubsw m8, m0, m13 + paddw m6, m8 + pmaddubsw m0, m12 + lea r0, [r0 + r1 * 4] + movu xm8, [r0] ; m8 = row 24 + punpckhbw xm9, xm1, xm8 + punpcklbw xm1, xm8 + vinserti128 m1, m1, xm9, 1 + pmaddubsw m9, m1, m13 + paddw m7, m9 + pmaddubsw m1, m12 + movu xm9, [r0 + r1] ; m9 = row 25 + punpckhbw xm10, xm8, xm9 + punpcklbw xm8, xm9 + vinserti128 m8, m8, xm10, 1 + pmaddubsw m8, m13 + paddw m0, m8 + movu xm10, [r0 + r1 * 2] ; m10 = row 26 + punpckhbw xm11, xm9, xm10 + punpcklbw xm9, xm10 + vinserti128 m9, m9, xm11, 1 + pmaddubsw m9, m13 + paddw m1, m9 + +%ifidn %1,pp + pmulhrsw m2, m14 ; m2 = word: row 16 + pmulhrsw m3, m14 ; m3 = word: row 17 + pmulhrsw m4, m14 ; m4 = word: row 18 + pmulhrsw m5, m14 ; m5 = word: row 19 + pmulhrsw m6, m14 ; m6 = word: row 20 + pmulhrsw m7, m14 ; m7 = word: row 21 + pmulhrsw m0, m14 ; m0 = word: row 22 + pmulhrsw m1, m14 ; m1 = word: row 23 + packuswb m2, m3 + packuswb m4, m5 + packuswb m6, m7 + packuswb m0, m1 + vpermq m2, m2, q3120 + vpermq m4, m4, q3120 + vpermq m6, m6, q3120 + vpermq m0, m0, q3120 + vextracti128 xm3, m2, 1 + vextracti128 xm5, m4, 1 + vextracti128 xm7, m6, 1 + vextracti128 xm1, m0, 1 + movu [r2], xm2 + movu [r2 + r3], xm3 + movu [r2 + r3 * 2], xm4 + movu [r2 + r5], xm5 + lea r2, [r2 + r3 * 4] + movu [r2], xm6 + movu [r2 + r3], xm7 + movu [r2 + r3 * 2], xm0 + movu [r2 + r5], xm1 +%else + psubw m2, m14 ; m2 = word: row 16 + psubw m3, m14 ; m3 = word: row 17 + psubw m4, m14 ; m4 = word: row 18 + psubw m5, m14 ; m5 = word: row 19 + psubw m6, m14 ; m6 = word: row 20 + psubw m7, m14 ; m7 = word: row 21 + psubw m0, m14 ; m0 = word: row 22 + psubw m1, m14 ; m1 = word: row 23 + movu [r2], m2 + movu [r2 + r3], m3 + movu [r2 + r3 * 2], m4 + movu [r2 + r5], m5 + lea r2, [r2 + r3 * 4] + movu [r2], m6 + movu [r2 + r3], m7 + movu [r2 + r3 * 2], m0 + movu [r2 + r5], m1 +%endif + RET +%endif +%endmacro + + FILTER_VER_CHROMA_AVX2_16x24 pp + FILTER_VER_CHROMA_AVX2_16x24 ps %macro FILTER_VER_CHROMA_AVX2_24x32 1 INIT_YMM avx2 @@ -6863,8 +9113,8 @@ %endif %endmacro -FILTER_VER_CHROMA_AVX2_24x32 pp -FILTER_VER_CHROMA_AVX2_24x32 ps + FILTER_VER_CHROMA_AVX2_24x32 pp + FILTER_VER_CHROMA_AVX2_24x32 ps %macro FILTER_VER_CHROMA_AVX2_16x4 1 INIT_YMM avx2 @@ -6961,12 +9211,12 @@ RET %endmacro -FILTER_VER_CHROMA_AVX2_16x4 pp -FILTER_VER_CHROMA_AVX2_16x4 ps + FILTER_VER_CHROMA_AVX2_16x4 pp + FILTER_VER_CHROMA_AVX2_16x4 ps -%macro FILTER_VER_CHROMA_AVX2_12x16 1 +%macro FILTER_VER_CHROMA_AVX2_12xN 2 INIT_YMM avx2 -cglobal interp_4tap_vert_%1_12x16, 4, 7, 8 +cglobal interp_4tap_vert_%1_12x%2, 4, 7, 8 mov r4d, r4m shl r4d, 6 @@ -6986,7 +9236,7 @@ vbroadcasti128 m7, [pw_2000] %endif lea r6, [r3 * 3] - +%rep %2 / 16 movu xm0, [r0] ; m0 = row 0 movu xm1, [r0 + r1] ; m1 = row 1 punpckhbw xm2, xm0, xm1 @@ -7272,11 +9522,15 @@ vextracti128 xm5, m5, 1 movq [r2 + r6 + 16], xm5 %endif + lea r2, [r2 + r3 * 4] +%endrep RET %endmacro -FILTER_VER_CHROMA_AVX2_12x16 pp -FILTER_VER_CHROMA_AVX2_12x16 ps + FILTER_VER_CHROMA_AVX2_12xN pp, 16 + FILTER_VER_CHROMA_AVX2_12xN ps, 16 + FILTER_VER_CHROMA_AVX2_12xN pp, 32 + FILTER_VER_CHROMA_AVX2_12xN ps, 32 ;----------------------------------------------------------------------------- ;void interp_4tap_vert_pp_24x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -7285,121 +9539,121 @@ INIT_XMM sse4 cglobal interp_4tap_vert_pp_24x%2, 4, 6, 8 -mov r4d, r4m -sub r0, r1 + mov r4d, r4m + sub r0, r1 %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd m0, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] %else -movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [tab_ChromaCoeff + r4 * 4] %endif -pshufb m1, m0, [tab_Vm] -pshufb m0, [tab_Vm + 16] + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] -mov r4d, %2 + mov r4d, %2 .loop: -movu m2, [r0] -movu m3, [r0 + r1] + movu m2, [r0] + movu m3, [r0 + r1] -punpcklbw m4, m2, m3 -punpckhbw m2, m3 + punpcklbw m4, m2, m3 + punpckhbw m2, m3 -pmaddubsw m4, m1 -pmaddubsw m2, m1 + pmaddubsw m4, m1 + pmaddubsw m2, m1 -lea r5, [r0 + 2 * r1] -movu m5, [r5] -movu m7, [r5 + r1] + lea r5, [r0 + 2 * r1] + movu m5, [r5] + movu m7, [r5 + r1] -punpcklbw m6, m5, m7 -pmaddubsw m6, m0 -paddw m4, m6 + punpcklbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m4, m6 -punpckhbw m6, m5, m7 -pmaddubsw m6, m0 -paddw m2, m6 + punpckhbw m6, m5, m7 + pmaddubsw m6, m0 + paddw m2, m6 -mova m6, [pw_512] + mova m6, [pw_512] -pmulhrsw m4, m6 -pmulhrsw m2, m6 + pmulhrsw m4, m6 + pmulhrsw m2, m6 -packuswb m4, m2 + packuswb m4, m2 -movu [r2], m4 + movu [r2], m4 -punpcklbw m4, m3, m5 -punpckhbw m3, m5 + punpcklbw m4, m3, m5 + punpckhbw m3, m5 -pmaddubsw m4, m1 -pmaddubsw m3, m1 + pmaddubsw m4, m1 + pmaddubsw m3, m1 -movu m2, [r5 + 2 * r1] + movu m2, [r5 + 2 * r1] -punpcklbw m5, m7, m2 -punpckhbw m7, m2 + punpcklbw m5, m7, m2 + punpckhbw m7, m2 -pmaddubsw m5, m0 -pmaddubsw m7, m0 + pmaddubsw m5, m0 + pmaddubsw m7, m0 -paddw m4, m5 -paddw m3, m7 + paddw m4, m5 + paddw m3, m7 -pmulhrsw m4, m6 -pmulhrsw m3, m6 + pmulhrsw m4, m6 + pmulhrsw m3, m6 -packuswb m4, m3 + packuswb m4, m3 -movu [r2 + r3], m4 + movu [r2 + r3], m4 -movq m2, [r0 + 16] -movq m3, [r0 + r1 + 16] -movq m4, [r5 + 16] -movq m5, [r5 + r1 + 16] + movq m2, [r0 + 16] + movq m3, [r0 + r1 + 16] + movq m4, [r5 + 16] + movq m5, [r5 + r1 + 16] -punpcklbw m2, m3 -punpcklbw m4, m5 + punpcklbw m2, m3 + punpcklbw m4, m5 -pmaddubsw m2, m1 -pmaddubsw m4, m0 + pmaddubsw m2, m1 + pmaddubsw m4, m0 -paddw m2, m4 + paddw m2, m4 -pmulhrsw m2, m6 + pmulhrsw m2, m6 -movq m3, [r0 + r1 + 16] -movq m4, [r5 + 16] -movq m5, [r5 + r1 + 16] -movq m7, [r5 + 2 * r1 + 16] + movq m3, [r0 + r1 + 16] + movq m4, [r5 + 16] + movq m5, [r5 + r1 + 16] + movq m7, [r5 + 2 * r1 + 16] -punpcklbw m3, m4 -punpcklbw m5, m7 + punpcklbw m3, m4 + punpcklbw m5, m7 -pmaddubsw m3, m1 -pmaddubsw m5, m0 + pmaddubsw m3, m1 + pmaddubsw m5, m0 -paddw m3, m5 + paddw m3, m5 -pmulhrsw m3, m6 -packuswb m2, m3 + pmulhrsw m3, m6 + packuswb m2, m3 -movh [r2 + 16], m2 -movhps [r2 + r3 + 16], m2 + movh [r2 + 16], m2 + movhps [r2 + r3 + 16], m2 -mov r0, r5 -lea r2, [r2 + 2 * r3] + mov r0, r5 + lea r2, [r2 + 2 * r3] -sub r4, 2 -jnz .loop -RET + sub r4, 2 + jnz .loop + RET %endmacro -FILTER_V4_W24 24, 32 + FILTER_V4_W24 24, 32 -FILTER_V4_W24 24, 64 + FILTER_V4_W24 24, 64 ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_32x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -7408,100 +9662,100 @@ INIT_XMM sse4 cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8 -mov r4d, r4m -sub r0, r1 + mov r4d, r4m + sub r0, r1 %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd m0, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] %else -movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [tab_ChromaCoeff + r4 * 4] %endif -pshufb m1, m0, [tab_Vm] -pshufb m0, [tab_Vm + 16] + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] -mova m7, [pw_512] + mova m7, [pw_512] -mov r4d, %2 + mov r4d, %2 .loop: -movu m2, [r0] -movu m3, [r0 + r1] + movu m2, [r0] + movu m3, [r0 + r1] -punpcklbw m4, m2, m3 -punpckhbw m2, m3 + punpcklbw m4, m2, m3 + punpckhbw m2, m3 -pmaddubsw m4, m1 -pmaddubsw m2, m1 + pmaddubsw m4, m1 + pmaddubsw m2, m1 -lea r5, [r0 + 2 * r1] -movu m3, [r5] -movu m5, [r5 + r1] + lea r5, [r0 + 2 * r1] + movu m3, [r5] + movu m5, [r5 + r1] -punpcklbw m6, m3, m5 -punpckhbw m3, m5 + punpcklbw m6, m3, m5 + punpckhbw m3, m5 -pmaddubsw m6, m0 -pmaddubsw m3, m0 + pmaddubsw m6, m0 + pmaddubsw m3, m0 -paddw m4, m6 -paddw m2, m3 + paddw m4, m6 + paddw m2, m3 -pmulhrsw m4, m7 -pmulhrsw m2, m7 + pmulhrsw m4, m7 + pmulhrsw m2, m7 -packuswb m4, m2 + packuswb m4, m2 -movu [r2], m4 + movu [r2], m4 -movu m2, [r0 + 16] -movu m3, [r0 + r1 + 16] + movu m2, [r0 + 16] + movu m3, [r0 + r1 + 16] -punpcklbw m4, m2, m3 -punpckhbw m2, m3 + punpcklbw m4, m2, m3 + punpckhbw m2, m3 -pmaddubsw m4, m1 -pmaddubsw m2, m1 + pmaddubsw m4, m1 + pmaddubsw m2, m1 -movu m3, [r5 + 16] -movu m5, [r5 + r1 + 16] + movu m3, [r5 + 16] + movu m5, [r5 + r1 + 16] -punpcklbw m6, m3, m5 -punpckhbw m3, m5 + punpcklbw m6, m3, m5 + punpckhbw m3, m5 -pmaddubsw m6, m0 -pmaddubsw m3, m0 + pmaddubsw m6, m0 + pmaddubsw m3, m0 -paddw m4, m6 -paddw m2, m3 + paddw m4, m6 + paddw m2, m3 -pmulhrsw m4, m7 -pmulhrsw m2, m7 + pmulhrsw m4, m7 + pmulhrsw m2, m7 -packuswb m4, m2 + packuswb m4, m2 -movu [r2 + 16], m4 + movu [r2 + 16], m4 -lea r0, [r0 + r1] -lea r2, [r2 + r3] + lea r0, [r0 + r1] + lea r2, [r2 + r3] -dec r4 -jnz .loop -RET + dec r4 + jnz .loop + RET %endmacro -FILTER_V4_W32 32, 8 -FILTER_V4_W32 32, 16 -FILTER_V4_W32 32, 24 -FILTER_V4_W32 32, 32 + FILTER_V4_W32 32, 8 + FILTER_V4_W32 32, 16 + FILTER_V4_W32 32, 24 + FILTER_V4_W32 32, 32 -FILTER_V4_W32 32, 48 -FILTER_V4_W32 32, 64 + FILTER_V4_W32 32, 48 + FILTER_V4_W32 32, 64 %macro FILTER_VER_CHROMA_AVX2_32xN 2 -INIT_YMM avx2 %if ARCH_X86_64 == 1 +INIT_YMM avx2 cglobal interp_4tap_vert_%1_32x%2, 4, 7, 13 mov r4d, r4m shl r4d, 6 @@ -7631,14 +9885,18 @@ %endif %endmacro -FILTER_VER_CHROMA_AVX2_32xN pp, 32 -FILTER_VER_CHROMA_AVX2_32xN pp, 24 -FILTER_VER_CHROMA_AVX2_32xN pp, 16 -FILTER_VER_CHROMA_AVX2_32xN pp, 8 -FILTER_VER_CHROMA_AVX2_32xN ps, 32 -FILTER_VER_CHROMA_AVX2_32xN ps, 24 -FILTER_VER_CHROMA_AVX2_32xN ps, 16 -FILTER_VER_CHROMA_AVX2_32xN ps, 8 + FILTER_VER_CHROMA_AVX2_32xN pp, 64 + FILTER_VER_CHROMA_AVX2_32xN pp, 48 + FILTER_VER_CHROMA_AVX2_32xN pp, 32 + FILTER_VER_CHROMA_AVX2_32xN pp, 24 + FILTER_VER_CHROMA_AVX2_32xN pp, 16 + FILTER_VER_CHROMA_AVX2_32xN pp, 8 + FILTER_VER_CHROMA_AVX2_32xN ps, 64 + FILTER_VER_CHROMA_AVX2_32xN ps, 48 + FILTER_VER_CHROMA_AVX2_32xN ps, 32 + FILTER_VER_CHROMA_AVX2_32xN ps, 24 + FILTER_VER_CHROMA_AVX2_32xN ps, 16 + FILTER_VER_CHROMA_AVX2_32xN ps, 8 ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -7647,413 +9905,1338 @@ INIT_XMM sse4 cglobal interp_4tap_vert_pp_%1x%2, 4, 7, 8 -mov r4d, r4m -sub r0, r1 + mov r4d, r4m + sub r0, r1 %ifdef PIC -lea r5, [tab_ChromaCoeff] -movd m0, [r5 + r4 * 4] + lea r5, [tab_ChromaCoeff] + movd m0, [r5 + r4 * 4] %else -movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [tab_ChromaCoeff + r4 * 4] %endif -pshufb m1, m0, [tab_Vm] -pshufb m0, [tab_Vm + 16] + pshufb m1, m0, [tab_Vm] + pshufb m0, [tab_Vm + 16] -mov r4d, %2/2 + mov r4d, %2/2 .loop: -mov r6d, %1/16 + mov r6d, %1/16 .loopW: -movu m2, [r0] -movu m3, [r0 + r1] + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + pmaddubsw m4, m1 + pmaddubsw m2, m1 + + lea r5, [r0 + 2 * r1] + movu m5, [r5] + movu m6, [r5 + r1] -punpcklbw m4, m2, m3 -punpckhbw m2, m3 + punpckhbw m7, m5, m6 + pmaddubsw m7, m0 + paddw m2, m7 -pmaddubsw m4, m1 -pmaddubsw m2, m1 + punpcklbw m7, m5, m6 + pmaddubsw m7, m0 + paddw m4, m7 -lea r5, [r0 + 2 * r1] -movu m5, [r5] -movu m6, [r5 + r1] + mova m7, [pw_512] -punpckhbw m7, m5, m6 -pmaddubsw m7, m0 -paddw m2, m7 + pmulhrsw m4, m7 + pmulhrsw m2, m7 -punpcklbw m7, m5, m6 -pmaddubsw m7, m0 -paddw m4, m7 + packuswb m4, m2 -mova m7, [pw_512] + movu [r2], m4 -pmulhrsw m4, m7 -pmulhrsw m2, m7 + punpcklbw m4, m3, m5 + punpckhbw m3, m5 -packuswb m4, m2 + pmaddubsw m4, m1 + pmaddubsw m3, m1 -movu [r2], m4 + movu m5, [r5 + 2 * r1] -punpcklbw m4, m3, m5 -punpckhbw m3, m5 + punpcklbw m2, m6, m5 + punpckhbw m6, m5 -pmaddubsw m4, m1 -pmaddubsw m3, m1 + pmaddubsw m2, m0 + pmaddubsw m6, m0 -movu m5, [r5 + 2 * r1] + paddw m4, m2 + paddw m3, m6 -punpcklbw m2, m6, m5 -punpckhbw m6, m5 + pmulhrsw m4, m7 + pmulhrsw m3, m7 -pmaddubsw m2, m0 -pmaddubsw m6, m0 + packuswb m4, m3 -paddw m4, m2 -paddw m3, m6 + movu [r2 + r3], m4 -pmulhrsw m4, m7 -pmulhrsw m3, m7 + add r0, 16 + add r2, 16 + dec r6d + jnz .loopW -packuswb m4, m3 + lea r0, [r0 + r1 * 2 - %1] + lea r2, [r2 + r3 * 2 - %1] -movu [r2 + r3], m4 + dec r4d + jnz .loop + RET +%endmacro -add r0, 16 -add r2, 16 -dec r6d -jnz .loopW + FILTER_V4_W16n_H2 64, 64 + FILTER_V4_W16n_H2 64, 32 + FILTER_V4_W16n_H2 64, 48 + FILTER_V4_W16n_H2 48, 64 + FILTER_V4_W16n_H2 64, 16 -lea r0, [r0 + r1 * 2 - %1] -lea r2, [r2 + r3 * 2 - %1] +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_2xN 1 +INIT_XMM sse4 +cglobal filterPixelToShort_2x%1, 3, 4, 3 + mov r3d, r3m + add r3d, r3d -dec r4d -jnz .loop -RET + ; load constant + mova m1, [pb_128] + mova m2, [tab_c_64_n64] + +%rep %1/2 + movd m0, [r0] + pinsrd m0, [r0 + r1], 1 + punpcklbw m0, m1 + pmaddubsw m0, m2 + + movd [r2 + r3 * 0], m0 + pextrd [r2 + r3 * 1], m0, 2 + + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] +%endrep + RET %endmacro + P2S_H_2xN 4 + P2S_H_2xN 8 + P2S_H_2xN 16 -FILTER_V4_W16n_H2 64, 64 -FILTER_V4_W16n_H2 64, 32 -FILTER_V4_W16n_H2 64, 48 -FILTER_V4_W16n_H2 48, 64 -FILTER_V4_W16n_H2 64, 16 ;----------------------------------------------------------------------------- -; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height) +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) ;----------------------------------------------------------------------------- -%macro PIXEL_WH_4xN 2 -INIT_XMM ssse3 -cglobal pixelToShort_%1x%2, 3, 7, 6 +%macro P2S_H_4xN 1 +INIT_XMM sse4 +cglobal filterPixelToShort_4x%1, 3, 6, 4 + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + mova m2, [pb_128] + mova m3, [tab_c_64_n64] + +%assign x 0 +%rep %1/4 + movd m0, [r0] + pinsrd m0, [r0 + r1], 1 + punpcklbw m0, m2 + pmaddubsw m0, m3 + + movd m1, [r0 + r1 * 2] + pinsrd m1, [r0 + r5], 1 + punpcklbw m1, m2 + pmaddubsw m1, m3 + + movq [r2 + r3 * 0], m0 + movq [r2 + r3 * 2], m1 + movhps [r2 + r3 * 1], m0 + movhps [r2 + r4], m1 +%assign x x+1 +%if (x != %1/4) + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endif +%endrep + RET +%endmacro + P2S_H_4xN 4 + P2S_H_4xN 8 + P2S_H_4xN 16 + P2S_H_4xN 32 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_6xN 1 +INIT_XMM sse4 +cglobal filterPixelToShort_6x%1, 3, 7, 6 + mov r3d, r3m + add r3d, r3d + lea r4, [r1 * 3] + lea r5, [r3 * 3] + + ; load height + mov r6d, %1/4 - ; load width and height - mov r3d, %1 - mov r4d, %2 ; load constant mova m4, [pb_128] mova m5, [tab_c_64_n64] -.loopH: - xor r5d, r5d -.loopW: - mov r6, r0 - movh m0, [r6] +.loop: + movh m0, [r0] punpcklbw m0, m4 pmaddubsw m0, m5 - movh m1, [r6 + r1] + movh m1, [r0 + r1] punpcklbw m1, m4 pmaddubsw m1, m5 - movh m2, [r6 + r1 * 2] + movh m2, [r0 + r1 * 2] punpcklbw m2, m4 pmaddubsw m2, m5 - lea r6, [r6 + r1 * 2] - movh m3, [r6 + r1] + movh m3, [r0 + r4] punpcklbw m3, m4 pmaddubsw m3, m5 - add r5, 8 - cmp r5, r3 - jg .width4 - movu [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0 - movu [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1 - movu [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2 - movu [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3 - je .nextH - jmp .loopW - -.width4: - movh [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0 - movh [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1 - movh [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2 - movh [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3 + movh [r2 + r3 * 0], m0 + pextrd [r2 + r3 * 0 + 8], m0, 2 + movh [r2 + r3 * 1], m1 + pextrd [r2 + r3 * 1 + 8], m1, 2 + movh [r2 + r3 * 2], m2 + pextrd [r2 + r3 * 2 + 8], m2, 2 + movh [r2 + r5], m3 + pextrd [r2 + r5 + 8], m3, 2 -.nextH: lea r0, [r0 + r1 * 4] - add r2, FENC_STRIDE * 8 + lea r2, [r2 + r3 * 4] - sub r4d, 4 - jnz .loopH + dec r6d + jnz .loop RET %endmacro -PIXEL_WH_4xN 4, 4 -PIXEL_WH_4xN 4, 8 -PIXEL_WH_4xN 4, 16 + P2S_H_6xN 8 + P2S_H_6xN 16 ;----------------------------------------------------------------------------- -; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height) +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) ;----------------------------------------------------------------------------- -%macro PIXEL_WH_8xN 2 +%macro P2S_H_8xN 1 INIT_XMM ssse3 -cglobal pixelToShort_%1x%2, 3, 7, 6 +cglobal filterPixelToShort_8x%1, 3, 7, 6 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] - ; load width and height - mov r3d, %1 - mov r4d, %2 + ; load height + mov r4d, %1/4 ; load constant mova m4, [pb_128] mova m5, [tab_c_64_n64] -.loopH - xor r5d, r5d -.loopW - lea r6, [r0 + r5] - - movh m0, [r6] +.loop + movh m0, [r0] punpcklbw m0, m4 pmaddubsw m0, m5 - movh m1, [r6 + r1] + movh m1, [r0 + r1] punpcklbw m1, m4 pmaddubsw m1, m5 - movh m2, [r6 + r1 * 2] + movh m2, [r0 + r1 * 2] punpcklbw m2, m4 pmaddubsw m2, m5 - lea r6, [r6 + r1 * 2] - movh m3, [r6 + r1] + movh m3, [r0 + r5] punpcklbw m3, m4 pmaddubsw m3, m5 - add r5, 8 - cmp r5, r3 - - movu [r2 + FENC_STRIDE * 0], m0 - movu [r2 + FENC_STRIDE * 2], m1 - movu [r2 + FENC_STRIDE * 4], m2 - movu [r2 + FENC_STRIDE * 6], m3 - - je .nextH - jmp .loopW - + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r6 ], m3 -.nextH: lea r0, [r0 + r1 * 4] - add r2, FENC_STRIDE * 8 + lea r2, [r2 + r3 * 4] - sub r4d, 4 - jnz .loopH + dec r4d + jnz .loop RET %endmacro -PIXEL_WH_8xN 8, 8 -PIXEL_WH_8xN 8, 4 -PIXEL_WH_8xN 8, 16 -PIXEL_WH_8xN 8, 32 + P2S_H_8xN 8 + P2S_H_8xN 4 + P2S_H_8xN 16 + P2S_H_8xN 32 + P2S_H_8xN 12 + P2S_H_8xN 64 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +INIT_XMM ssse3 +cglobal filterPixelToShort_8x6, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r4, [r1 * 3] + lea r5, [r1 * 5] + lea r6, [r3 * 3] + + ; load constant + mova m3, [pb_128] + mova m4, [tab_c_64_n64] + + movh m0, [r0] + punpcklbw m0, m3 + pmaddubsw m0, m4 + + movh m1, [r0 + r1] + punpcklbw m1, m3 + pmaddubsw m1, m4 + + movh m2, [r0 + r1 * 2] + punpcklbw m2, m3 + pmaddubsw m2, m4 + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + movu [r2 + r3 * 2], m2 + + movh m0, [r0 + r4] + punpcklbw m0, m3 + pmaddubsw m0, m4 + + movh m1, [r0 + r1 * 4] + punpcklbw m1, m3 + pmaddubsw m1, m4 + + movh m2, [r0 + r5] + punpcklbw m2, m3 + pmaddubsw m2, m4 + + movu [r2 + r6 ], m0 + movu [r2 + r3 * 4], m1 + lea r2, [r2 + r3 * 4] + movu [r2 + r3], m2 + + RET ;----------------------------------------------------------------------------- -; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height) +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) ;----------------------------------------------------------------------------- -%macro PIXEL_WH_16xN 2 +%macro P2S_H_16xN 1 INIT_XMM ssse3 -cglobal pixelToShort_%1x%2, 3, 7, 6 +cglobal filterPixelToShort_16x%1, 3, 7, 6 + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] - ; load width and height - mov r3d, %1 - mov r4d, %2 + ; load height + mov r6d, %1/4 ; load constant mova m4, [pb_128] mova m5, [tab_c_64_n64] -.loopH: - xor r5d, r5d -.loopW: - lea r6, [r0 + r5] - - movh m0, [r6] +.loop: + movh m0, [r0] punpcklbw m0, m4 pmaddubsw m0, m5 - movh m1, [r6 + r1] + movh m1, [r0 + r1] punpcklbw m1, m4 pmaddubsw m1, m5 - movh m2, [r6 + r1 * 2] + movh m2, [r0 + r1 * 2] punpcklbw m2, m4 pmaddubsw m2, m5 - lea r6, [r6 + r1 * 2] - movh m3, [r6 + r1] + movh m3, [r0 + r5] punpcklbw m3, m4 pmaddubsw m3, m5 - add r5, 8 - cmp r5, r3 + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r4], m3 - movu [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0 - movu [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1 - movu [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2 - movu [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3 - je .nextH - jmp .loopW + lea r0, [r0 + 8] + movh m0, [r0] + punpcklbw m0, m4 + pmaddubsw m0, m5 -.nextH: - lea r0, [r0 + r1 * 4] - add r2, FENC_STRIDE * 8 + movh m1, [r0 + r1] + punpcklbw m1, m4 + pmaddubsw m1, m5 - sub r4d, 4 - jnz .loopH + movh m2, [r0 + r1 * 2] + punpcklbw m2, m4 + pmaddubsw m2, m5 + + movh m3, [r0 + r5] + punpcklbw m3, m4 + pmaddubsw m3, m5 + + movu [r2 + r3 * 0 + 16], m0 + movu [r2 + r3 * 1 + 16], m1 + movu [r2 + r3 * 2 + 16], m2 + movu [r2 + r4 + 16], m3 + lea r0, [r0 + r1 * 4 - 8] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop RET %endmacro -PIXEL_WH_16xN 16, 16 -PIXEL_WH_16xN 16, 8 -PIXEL_WH_16xN 16, 4 -PIXEL_WH_16xN 16, 12 -PIXEL_WH_16xN 16, 32 -PIXEL_WH_16xN 16, 64 + P2S_H_16xN 16 + P2S_H_16xN 4 + P2S_H_16xN 8 + P2S_H_16xN 12 + P2S_H_16xN 32 + P2S_H_16xN 64 + P2S_H_16xN 24 ;----------------------------------------------------------------------------- -; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height) +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) ;----------------------------------------------------------------------------- -%macro PIXEL_WH_32xN 2 +%macro P2S_H_32xN 1 INIT_XMM ssse3 -cglobal pixelToShort_%1x%2, 3, 7, 6 +cglobal filterPixelToShort_32x%1, 3, 7, 6 + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] - ; load width and height - mov r3d, %1 - mov r4d, %2 + ; load height + mov r6d, %1/4 ; load constant mova m4, [pb_128] mova m5, [tab_c_64_n64] -.loopH: - xor r5d, r5d -.loopW: - lea r6, [r0 + r5] +.loop: + movh m0, [r0] + punpcklbw m0, m4 + pmaddubsw m0, m5 + + movh m1, [r0 + r1] + punpcklbw m1, m4 + pmaddubsw m1, m5 + + movh m2, [r0 + r1 * 2] + punpcklbw m2, m4 + pmaddubsw m2, m5 + + movh m3, [r0 + r5] + punpcklbw m3, m4 + pmaddubsw m3, m5 + + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r4], m3 + + lea r0, [r0 + 8] + + movh m0, [r0] + punpcklbw m0, m4 + pmaddubsw m0, m5 + + movh m1, [r0 + r1] + punpcklbw m1, m4 + pmaddubsw m1, m5 + + movh m2, [r0 + r1 * 2] + punpcklbw m2, m4 + pmaddubsw m2, m5 - movh m0, [r6] + movh m3, [r0 + r5] + punpcklbw m3, m4 + pmaddubsw m3, m5 + + movu [r2 + r3 * 0 + 16], m0 + movu [r2 + r3 * 1 + 16], m1 + movu [r2 + r3 * 2 + 16], m2 + movu [r2 + r4 + 16], m3 + + lea r0, [r0 + 8] + + movh m0, [r0] punpcklbw m0, m4 pmaddubsw m0, m5 - movh m1, [r6 + r1] + movh m1, [r0 + r1] punpcklbw m1, m4 pmaddubsw m1, m5 - movh m2, [r6 + r1 * 2] + movh m2, [r0 + r1 * 2] punpcklbw m2, m4 pmaddubsw m2, m5 - lea r6, [r6 + r1 * 2] - movh m3, [r6 + r1] + movh m3, [r0 + r5] punpcklbw m3, m4 pmaddubsw m3, m5 - add r5, 8 - cmp r5, r3 + movu [r2 + r3 * 0 + 32], m0 + movu [r2 + r3 * 1 + 32], m1 + movu [r2 + r3 * 2 + 32], m2 + movu [r2 + r4 + 32], m3 - movu [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0 - movu [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1 - movu [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2 - movu [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3 - je .nextH - jmp .loopW + lea r0, [r0 + 8] + movh m0, [r0] + punpcklbw m0, m4 + pmaddubsw m0, m5 -.nextH: - lea r0, [r0 + r1 * 4] - add r2, FENC_STRIDE * 8 + movh m1, [r0 + r1] + punpcklbw m1, m4 + pmaddubsw m1, m5 - sub r4d, 4 - jnz .loopH + movh m2, [r0 + r1 * 2] + punpcklbw m2, m4 + pmaddubsw m2, m5 + + movh m3, [r0 + r5] + punpcklbw m3, m4 + pmaddubsw m3, m5 + + movu [r2 + r3 * 0 + 48], m0 + movu [r2 + r3 * 1 + 48], m1 + movu [r2 + r3 * 2 + 48], m2 + movu [r2 + r4 + 48], m3 + + lea r0, [r0 + r1 * 4 - 24] + lea r2, [r2 + r3 * 4] + dec r6d + jnz .loop RET %endmacro -PIXEL_WH_32xN 32, 32 -PIXEL_WH_32xN 32, 8 -PIXEL_WH_32xN 32, 16 -PIXEL_WH_32xN 32, 24 -PIXEL_WH_32xN 32, 64 + P2S_H_32xN 32 + P2S_H_32xN 8 + P2S_H_32xN 16 + P2S_H_32xN 24 + P2S_H_32xN 64 + P2S_H_32xN 48 ;----------------------------------------------------------------------------- -; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height) +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) ;----------------------------------------------------------------------------- -%macro PIXEL_WH_64xN 2 +%macro P2S_H_32xN_avx2 1 +INIT_YMM avx2 +cglobal filterPixelToShort_32x%1, 3, 7, 3 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load height + mov r4d, %1/4 + + ; load constant + vpbroadcastd m2, [pw_2000] + +.loop: + pmovzxbw m0, [r0 + 0 * mmsize/2] + pmovzxbw m1, [r0 + 1 * mmsize/2] + psllw m0, 6 + psllw m1, 6 + psubw m0, m2 + psubw m1, m2 + movu [r2 + 0 * mmsize], m0 + movu [r2 + 1 * mmsize], m1 + + pmovzxbw m0, [r0 + r1 + 0 * mmsize/2] + pmovzxbw m1, [r0 + r1 + 1 * mmsize/2] + psllw m0, 6 + psllw m1, 6 + psubw m0, m2 + psubw m1, m2 + movu [r2 + r3 + 0 * mmsize], m0 + movu [r2 + r3 + 1 * mmsize], m1 + + pmovzxbw m0, [r0 + r1 * 2 + 0 * mmsize/2] + pmovzxbw m1, [r0 + r1 * 2 + 1 * mmsize/2] + psllw m0, 6 + psllw m1, 6 + psubw m0, m2 + psubw m1, m2 + movu [r2 + r3 * 2 + 0 * mmsize], m0 + movu [r2 + r3 * 2 + 1 * mmsize], m1 + + pmovzxbw m0, [r0 + r5 + 0 * mmsize/2] + pmovzxbw m1, [r0 + r5 + 1 * mmsize/2] + psllw m0, 6 + psllw m1, 6 + psubw m0, m2 + psubw m1, m2 + movu [r2 + r6 + 0 * mmsize], m0 + movu [r2 + r6 + 1 * mmsize], m1 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r4d + jnz .loop + RET +%endmacro + P2S_H_32xN_avx2 32 + P2S_H_32xN_avx2 8 + P2S_H_32xN_avx2 16 + P2S_H_32xN_avx2 24 + P2S_H_32xN_avx2 64 + P2S_H_32xN_avx2 48 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_64xN 1 INIT_XMM ssse3 -cglobal pixelToShort_%1x%2, 3, 7, 6 +cglobal filterPixelToShort_64x%1, 3, 7, 6 + mov r3d, r3m + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] - ; load width and height - mov r3d, %1 - mov r4d, %2 + ; load height + mov r6d, %1/4 ; load constant mova m4, [pb_128] mova m5, [tab_c_64_n64] -.loopH: - xor r5d, r5d -.loopW: - lea r6, [r0 + r5] +.loop: + movh m0, [r0] + punpcklbw m0, m4 + pmaddubsw m0, m5 + + movh m1, [r0 + r1] + punpcklbw m1, m4 + pmaddubsw m1, m5 + + movh m2, [r0 + r1 * 2] + punpcklbw m2, m4 + pmaddubsw m2, m5 - movh m0, [r6] + movh m3, [r0 + r5] + punpcklbw m3, m4 + pmaddubsw m3, m5 + + movu [r2 + r3 * 0], m0 + movu [r2 + r3 * 1], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r4], m3 + + lea r0, [r0 + 8] + + movh m0, [r0] punpcklbw m0, m4 pmaddubsw m0, m5 - movh m1, [r6 + r1] + movh m1, [r0 + r1] punpcklbw m1, m4 pmaddubsw m1, m5 - movh m2, [r6 + r1 * 2] + movh m2, [r0 + r1 * 2] punpcklbw m2, m4 pmaddubsw m2, m5 - lea r6, [r6 + r1 * 2] - movh m3, [r6 + r1] + movh m3, [r0 + r5] punpcklbw m3, m4 pmaddubsw m3, m5 - add r5, 8 - cmp r5, r3 + movu [r2 + r3 * 0 + 16], m0 + movu [r2 + r3 * 1 + 16], m1 + movu [r2 + r3 * 2 + 16], m2 + movu [r2 + r4 + 16], m3 + + lea r0, [r0 + 8] + + movh m0, [r0] + punpcklbw m0, m4 + pmaddubsw m0, m5 + + movh m1, [r0 + r1] + punpcklbw m1, m4 + pmaddubsw m1, m5 + + movh m2, [r0 + r1 * 2] + punpcklbw m2, m4 + pmaddubsw m2, m5 + + movh m3, [r0 + r5] + punpcklbw m3, m4 + pmaddubsw m3, m5 - movu [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0 - movu [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1 - movu [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2 - movu [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3 - je .nextH - jmp .loopW + movu [r2 + r3 * 0 + 32], m0 + movu [r2 + r3 * 1 + 32], m1 + movu [r2 + r3 * 2 + 32], m2 + movu [r2 + r4 + 32], m3 + lea r0, [r0 + 8] + + movh m0, [r0] + punpcklbw m0, m4 + pmaddubsw m0, m5 + + movh m1, [r0 + r1] + punpcklbw m1, m4 + pmaddubsw m1, m5 + + movh m2, [r0 + r1 * 2] + punpcklbw m2, m4 + pmaddubsw m2, m5 + + movh m3, [r0 + r5] + punpcklbw m3, m4 + pmaddubsw m3, m5 + + movu [r2 + r3 * 0 + 48], m0 + movu [r2 + r3 * 1 + 48], m1 + movu [r2 + r3 * 2 + 48], m2 + movu [r2 + r4 + 48], m3 + + lea r0, [r0 + 8] + + movh m0, [r0] + punpcklbw m0, m4 + pmaddubsw m0, m5 + + movh m1, [r0 + r1] + punpcklbw m1, m4 + pmaddubsw m1, m5 + + movh m2, [r0 + r1 * 2] + punpcklbw m2, m4 + pmaddubsw m2, m5 + + movh m3, [r0 + r5] + punpcklbw m3, m4 + pmaddubsw m3, m5 + + movu [r2 + r3 * 0 + 64], m0 + movu [r2 + r3 * 1 + 64], m1 + movu [r2 + r3 * 2 + 64], m2 + movu [r2 + r4 + 64], m3 + + lea r0, [r0 + 8] + + movh m0, [r0] + punpcklbw m0, m4 + pmaddubsw m0, m5 + + movh m1, [r0 + r1] + punpcklbw m1, m4 + pmaddubsw m1, m5 + + movh m2, [r0 + r1 * 2] + punpcklbw m2, m4 + pmaddubsw m2, m5 + + movh m3, [r0 + r5] + punpcklbw m3, m4 + pmaddubsw m3, m5 + + movu [r2 + r3 * 0 + 80], m0 + movu [r2 + r3 * 1 + 80], m1 + movu [r2 + r3 * 2 + 80], m2 + movu [r2 + r4 + 80], m3 + + lea r0, [r0 + 8] + + movh m0, [r0] + punpcklbw m0, m4 + pmaddubsw m0, m5 + + movh m1, [r0 + r1] + punpcklbw m1, m4 + pmaddubsw m1, m5 + + movh m2, [r0 + r1 * 2] + punpcklbw m2, m4 + pmaddubsw m2, m5 + + movh m3, [r0 + r5] + punpcklbw m3, m4 + pmaddubsw m3, m5 + + movu [r2 + r3 * 0 + 96], m0 + movu [r2 + r3 * 1 + 96], m1 + movu [r2 + r3 * 2 + 96], m2 + movu [r2 + r4 + 96], m3 + + lea r0, [r0 + 8] + + movh m0, [r0] + punpcklbw m0, m4 + pmaddubsw m0, m5 + + movh m1, [r0 + r1] + punpcklbw m1, m4 + pmaddubsw m1, m5 + + movh m2, [r0 + r1 * 2] + punpcklbw m2, m4 + pmaddubsw m2, m5 + + movh m3, [r0 + r5] + punpcklbw m3, m4 + pmaddubsw m3, m5 + + movu [r2 + r3 * 0 + 112], m0 + movu [r2 + r3 * 1 + 112], m1 + movu [r2 + r3 * 2 + 112], m2 + movu [r2 + r4 + 112], m3 + + lea r0, [r0 + r1 * 4 - 56] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET +%endmacro + P2S_H_64xN 64 + P2S_H_64xN 16 + P2S_H_64xN 32 + P2S_H_64xN 48 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_64xN_avx2 1 +INIT_YMM avx2 +cglobal filterPixelToShort_64x%1, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load height + mov r4d, %1/4 + + ; load constant + vpbroadcastd m4, [pw_2000] + +.loop: + pmovzxbw m0, [r0 + 0 * mmsize/2] + pmovzxbw m1, [r0 + 1 * mmsize/2] + pmovzxbw m2, [r0 + 2 * mmsize/2] + pmovzxbw m3, [r0 + 3 * mmsize/2] + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psllw m3, 6 + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + + movu [r2 + 0 * mmsize], m0 + movu [r2 + 1 * mmsize], m1 + movu [r2 + 2 * mmsize], m2 + movu [r2 + 3 * mmsize], m3 + + pmovzxbw m0, [r0 + r1 + 0 * mmsize/2] + pmovzxbw m1, [r0 + r1 + 1 * mmsize/2] + pmovzxbw m2, [r0 + r1 + 2 * mmsize/2] + pmovzxbw m3, [r0 + r1 + 3 * mmsize/2] + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psllw m3, 6 + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + + movu [r2 + r3 + 0 * mmsize], m0 + movu [r2 + r3 + 1 * mmsize], m1 + movu [r2 + r3 + 2 * mmsize], m2 + movu [r2 + r3 + 3 * mmsize], m3 + + pmovzxbw m0, [r0 + r1 * 2 + 0 * mmsize/2] + pmovzxbw m1, [r0 + r1 * 2 + 1 * mmsize/2] + pmovzxbw m2, [r0 + r1 * 2 + 2 * mmsize/2] + pmovzxbw m3, [r0 + r1 * 2 + 3 * mmsize/2] + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psllw m3, 6 + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + + movu [r2 + r3 * 2 + 0 * mmsize], m0 + movu [r2 + r3 * 2 + 1 * mmsize], m1 + movu [r2 + r3 * 2 + 2 * mmsize], m2 + movu [r2 + r3 * 2 + 3 * mmsize], m3 + + pmovzxbw m0, [r0 + r5 + 0 * mmsize/2] + pmovzxbw m1, [r0 + r5 + 1 * mmsize/2] + pmovzxbw m2, [r0 + r5 + 2 * mmsize/2] + pmovzxbw m3, [r0 + r5 + 3 * mmsize/2] + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psllw m3, 6 + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + + movu [r2 + r6 + 0 * mmsize], m0 + movu [r2 + r6 + 1 * mmsize], m1 + movu [r2 + r6 + 2 * mmsize], m2 + movu [r2 + r6 + 3 * mmsize], m3 -.nextH: lea r0, [r0 + r1 * 4] - add r2, FENC_STRIDE * 8 + lea r2, [r2 + r3 * 4] - sub r4d, 4 - jnz .loopH + dec r4d + jnz .loop + RET +%endmacro + P2S_H_64xN_avx2 64 + P2S_H_64xN_avx2 16 + P2S_H_64xN_avx2 32 + P2S_H_64xN_avx2 48 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel src, intptr_t srcStride, int16_t dst, int16_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_12xN 1 +INIT_XMM ssse3 +cglobal filterPixelToShort_12x%1, 3, 7, 6 + mov r3d, r3m + add r3d, r3d + lea r4, [r1 * 3] + lea r6, [r3 * 3] + mov r5d, %1/4 + ; load constant + mova m4, [pb_128] + mova m5, [tab_c_64_n64] + +.loop: + movu m0, [r0] + punpcklbw m1, m0, m4 + punpckhbw m0, m4 + pmaddubsw m0, m5 + pmaddubsw m1, m5 + + movu m2, [r0 + r1] + punpcklbw m3, m2, m4 + punpckhbw m2, m4 + pmaddubsw m2, m5 + pmaddubsw m3, m5 + + movu [r2 + r3 * 0], m1 + movu [r2 + r3 * 1], m3 + + movh [r2 + r3 * 0 + 16], m0 + movh [r2 + r3 * 1 + 16], m2 + + movu m0, [r0 + r1 * 2] + punpcklbw m1, m0, m4 + punpckhbw m0, m4 + pmaddubsw m0, m5 + pmaddubsw m1, m5 + + movu m2, [r0 + r4] + punpcklbw m3, m2, m4 + punpckhbw m2, m4 + pmaddubsw m2, m5 + pmaddubsw m3, m5 + + movu [r2 + r3 * 2], m1 + movu [r2 + r6], m3 + + movh [r2 + r3 * 2 + 16], m0 + movh [r2 + r6 + 16], m2 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r5d + jnz .loop + RET +%endmacro + P2S_H_12xN 16 + P2S_H_12xN 32 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_24xN 1 +INIT_XMM ssse3 +cglobal filterPixelToShort_24x%1, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r4, [r1 * 3] + lea r5, [r3 * 3] + mov r6d, %1/4 + + ; load constant + mova m3, [pb_128] + mova m4, [tab_c_64_n64] + +.loop: + movu m0, [r0] + punpcklbw m1, m0, m3 + punpckhbw m0, m3 + pmaddubsw m0, m4 + pmaddubsw m1, m4 + + movu m2, [r0 + 16] + punpcklbw m2, m3 + pmaddubsw m2, m4 + + movu [r2 + r3 * 0], m1 + movu [r2 + r3 * 0 + 16], m0 + movu [r2 + r3 * 0 + 32], m2 + + movu m0, [r0 + r1] + punpcklbw m1, m0, m3 + punpckhbw m0, m3 + pmaddubsw m0, m4 + pmaddubsw m1, m4 + + movu m2, [r0 + r1 + 16] + punpcklbw m2, m3 + pmaddubsw m2, m4 + + movu [r2 + r3 * 1], m1 + movu [r2 + r3 * 1 + 16], m0 + movu [r2 + r3 * 1 + 32], m2 + + movu m0, [r0 + r1 * 2] + punpcklbw m1, m0, m3 + punpckhbw m0, m3 + pmaddubsw m0, m4 + pmaddubsw m1, m4 + + movu m2, [r0 + r1 * 2 + 16] + punpcklbw m2, m3 + pmaddubsw m2, m4 + + movu [r2 + r3 * 2], m1 + movu [r2 + r3 * 2 + 16], m0 + movu [r2 + r3 * 2 + 32], m2 + + movu m0, [r0 + r4] + punpcklbw m1, m0, m3 + punpckhbw m0, m3 + pmaddubsw m0, m4 + pmaddubsw m1, m4 + + movu m2, [r0 + r4 + 16] + punpcklbw m2, m3 + pmaddubsw m2, m4 + movu [r2 + r5], m1 + movu [r2 + r5 + 16], m0 + movu [r2 + r5 + 32], m2 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop RET %endmacro -PIXEL_WH_64xN 64, 64 -PIXEL_WH_64xN 64, 16 -PIXEL_WH_64xN 64, 32 -PIXEL_WH_64xN 64, 48 + P2S_H_24xN 32 + P2S_H_24xN 64 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_24xN_avx2 1 +INIT_YMM avx2 +cglobal filterPixelToShort_24x%1, 3, 7, 4 + mov r3d, r3m + add r3d, r3d + lea r4, [r1 * 3] + lea r5, [r3 * 3] + mov r6d, %1/4 + + ; load constant + vpbroadcastd m1, [pw_2000] + vpbroadcastd m2, [pb_128] + vpbroadcastd m3, [tab_c_64_n64] + +.loop: + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + movu m0, [r0 + mmsize/2] + punpcklbw m0, m2 + pmaddubsw m0, m3 + movu [r2 + r3 * 0 + mmsize], xm0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + movu m0, [r0 + r1 + mmsize/2] + punpcklbw m0, m2 + pmaddubsw m0, m3 + movu [r2 + r3 * 1 + mmsize], xm0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + movu m0, [r0 + r1 * 2 + mmsize/2] + punpcklbw m0, m2 + pmaddubsw m0, m3 + movu [r2 + r3 * 2 + mmsize], xm0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + + movu m0, [r0 + r4 + mmsize/2] + punpcklbw m0, m2 + pmaddubsw m0, m3 + movu [r2 + r5 + mmsize], xm0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET +%endmacro + P2S_H_24xN_avx2 32 + P2S_H_24xN_avx2 64 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +INIT_XMM ssse3 +cglobal filterPixelToShort_48x64, 3, 7, 4 + mov r3d, r3m + add r3d, r3d + lea r4, [r1 * 3] + lea r5, [r3 * 3] + mov r6d, 16 + + ; load constant + mova m2, [pb_128] + mova m3, [tab_c_64_n64] + +.loop: + movu m0, [r0] + punpcklbw m1, m0, m2 + punpckhbw m0, m2 + pmaddubsw m0, m3 + pmaddubsw m1, m3 + + movu [r2 + r3 * 0], m1 + movu [r2 + r3 * 0 + 16], m0 + + movu m0, [r0 + 16] + punpcklbw m1, m0, m2 + punpckhbw m0, m2 + pmaddubsw m0, m3 + pmaddubsw m1, m3 + + movu [r2 + r3 * 0 + 32], m1 + movu [r2 + r3 * 0 + 48], m0 + + movu m0, [r0 + 32] + punpcklbw m1, m0, m2 + punpckhbw m0, m2 + pmaddubsw m0, m3 + pmaddubsw m1, m3 + + movu [r2 + r3 * 0 + 64], m1 + movu [r2 + r3 * 0 + 80], m0 + + movu m0, [r0 + r1] + punpcklbw m1, m0, m2 + punpckhbw m0, m2 + pmaddubsw m0, m3 + pmaddubsw m1, m3 + + movu [r2 + r3 * 1], m1 + movu [r2 + r3 * 1 + 16], m0 + + movu m0, [r0 + r1 + 16] + punpcklbw m1, m0, m2 + punpckhbw m0, m2 + pmaddubsw m0, m3 + pmaddubsw m1, m3 + + movu [r2 + r3 * 1 + 32], m1 + movu [r2 + r3 * 1 + 48], m0 + + movu m0, [r0 + r1 + 32] + punpcklbw m1, m0, m2 + punpckhbw m0, m2 + pmaddubsw m0, m3 + pmaddubsw m1, m3 + + movu [r2 + r3 * 1 + 64], m1 + movu [r2 + r3 * 1 + 80], m0 + + movu m0, [r0 + r1 * 2] + punpcklbw m1, m0, m2 + punpckhbw m0, m2 + pmaddubsw m0, m3 + pmaddubsw m1, m3 + + movu [r2 + r3 * 2], m1 + movu [r2 + r3 * 2 + 16], m0 + + movu m0, [r0 + r1 * 2 + 16] + punpcklbw m1, m0, m2 + punpckhbw m0, m2 + pmaddubsw m0, m3 + pmaddubsw m1, m3 + + movu [r2 + r3 * 2 + 32], m1 + movu [r2 + r3 * 2 + 48], m0 + + movu m0, [r0 + r1 * 2 + 32] + punpcklbw m1, m0, m2 + punpckhbw m0, m2 + pmaddubsw m0, m3 + pmaddubsw m1, m3 + + movu [r2 + r3 * 2 + 64], m1 + movu [r2 + r3 * 2 + 80], m0 + + movu m0, [r0 + r4] + punpcklbw m1, m0, m2 + punpckhbw m0, m2 + pmaddubsw m0, m3 + pmaddubsw m1, m3 + + movu [r2 + r5], m1 + movu [r2 + r5 + 16], m0 + + movu m0, [r0 + r4 + 16] + punpcklbw m1, m0, m2 + punpckhbw m0, m2 + pmaddubsw m0, m3 + pmaddubsw m1, m3 + + movu [r2 + r5 + 32], m1 + movu [r2 + r5 + 48], m0 + + movu m0, [r0 + r4 + 32] + punpcklbw m1, m0, m2 + punpckhbw m0, m2 + pmaddubsw m0, m3 + pmaddubsw m1, m3 + + movu [r2 + r5 + 64], m1 + movu [r2 + r5 + 80], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal filterPixelToShort_48x64, 3,7,4 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load height + mov r4d, 64/4 + + ; load constant + vpbroadcastd m3, [pw_2000] + + ; just unroll(1) because it is best choice for 48x64 +.loop: + pmovzxbw m0, [r0 + 0 * mmsize/2] + pmovzxbw m1, [r0 + 1 * mmsize/2] + pmovzxbw m2, [r0 + 2 * mmsize/2] + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psubw m0, m3 + psubw m1, m3 + psubw m2, m3 + movu [r2 + 0 * mmsize], m0 + movu [r2 + 1 * mmsize], m1 + movu [r2 + 2 * mmsize], m2 + + pmovzxbw m0, [r0 + r1 + 0 * mmsize/2] + pmovzxbw m1, [r0 + r1 + 1 * mmsize/2] + pmovzxbw m2, [r0 + r1 + 2 * mmsize/2] + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psubw m0, m3 + psubw m1, m3 + psubw m2, m3 + movu [r2 + r3 + 0 * mmsize], m0 + movu [r2 + r3 + 1 * mmsize], m1 + movu [r2 + r3 + 2 * mmsize], m2 + + pmovzxbw m0, [r0 + r1 * 2 + 0 * mmsize/2] + pmovzxbw m1, [r0 + r1 * 2 + 1 * mmsize/2] + pmovzxbw m2, [r0 + r1 * 2 + 2 * mmsize/2] + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psubw m0, m3 + psubw m1, m3 + psubw m2, m3 + movu [r2 + r3 * 2 + 0 * mmsize], m0 + movu [r2 + r3 * 2 + 1 * mmsize], m1 + movu [r2 + r3 * 2 + 2 * mmsize], m2 + + pmovzxbw m0, [r0 + r5 + 0 * mmsize/2] + pmovzxbw m1, [r0 + r5 + 1 * mmsize/2] + pmovzxbw m2, [r0 + r5 + 2 * mmsize/2] + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psubw m0, m3 + psubw m1, m3 + psubw m2, m3 + movu [r2 + r6 + 0 * mmsize], m0 + movu [r2 + r6 + 1 * mmsize], m1 + movu [r2 + r6 + 2 * mmsize], m2 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r4d + jnz .loop + RET + %macro PROCESS_LUMA_W4_4R 0 movd m0, [r0] @@ -8495,36 +11678,36 @@ ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_4xN 4, 4, pp + FILTER_VER_LUMA_4xN 4, 4, pp ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_pp_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_4xN 4, 8, pp -FILTER_VER_LUMA_AVX2_4xN 4, 8, pp + FILTER_VER_LUMA_4xN 4, 8, pp + FILTER_VER_LUMA_AVX2_4xN 4, 8, pp ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_pp_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_4xN 4, 16, pp -FILTER_VER_LUMA_AVX2_4xN 4, 16, pp + FILTER_VER_LUMA_4xN 4, 16, pp + FILTER_VER_LUMA_AVX2_4xN 4, 16, pp ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_ps_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_4xN 4, 4, ps + FILTER_VER_LUMA_4xN 4, 4, ps ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_ps_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_4xN 4, 8, ps -FILTER_VER_LUMA_AVX2_4xN 4, 8, ps + FILTER_VER_LUMA_4xN 4, 8, ps + FILTER_VER_LUMA_AVX2_4xN 4, 8, ps ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_ps_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_4xN 4, 16, ps -FILTER_VER_LUMA_AVX2_4xN 4, 16, ps + FILTER_VER_LUMA_4xN 4, 16, ps + FILTER_VER_LUMA_AVX2_4xN 4, 16, ps %macro PROCESS_LUMA_AVX2_W8_8R 0 movq xm1, [r0] ; m1 = row 0 @@ -8895,50 +12078,50 @@ ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_pp_8x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_8xN 8, 4, pp -FILTER_VER_LUMA_AVX2_8x4 pp + FILTER_VER_LUMA_8xN 8, 4, pp + FILTER_VER_LUMA_AVX2_8x4 pp ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_pp_8x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_8xN 8, 8, pp -FILTER_VER_LUMA_AVX2_8x8 pp + FILTER_VER_LUMA_8xN 8, 8, pp + FILTER_VER_LUMA_AVX2_8x8 pp ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_pp_8x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_8xN 8, 16, pp -FILTER_VER_LUMA_AVX2_8xN 8, 16, pp + FILTER_VER_LUMA_8xN 8, 16, pp + FILTER_VER_LUMA_AVX2_8xN 8, 16, pp ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_pp_8x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_8xN 8, 32, pp -FILTER_VER_LUMA_AVX2_8xN 8, 32, pp + FILTER_VER_LUMA_8xN 8, 32, pp + FILTER_VER_LUMA_AVX2_8xN 8, 32, pp ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_ps_8x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_8xN 8, 4, ps -FILTER_VER_LUMA_AVX2_8x4 ps + FILTER_VER_LUMA_8xN 8, 4, ps + FILTER_VER_LUMA_AVX2_8x4 ps ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_ps_8x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_8xN 8, 8, ps -FILTER_VER_LUMA_AVX2_8x8 ps + FILTER_VER_LUMA_8xN 8, 8, ps + FILTER_VER_LUMA_AVX2_8x8 ps ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_ps_8x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_8xN 8, 16, ps -FILTER_VER_LUMA_AVX2_8xN 8, 16, ps + FILTER_VER_LUMA_8xN 8, 16, ps + FILTER_VER_LUMA_AVX2_8xN 8, 16, ps ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_ps_8x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_8xN 8, 32, ps -FILTER_VER_LUMA_AVX2_8xN 8, 32, ps + FILTER_VER_LUMA_8xN 8, 32, ps + FILTER_VER_LUMA_AVX2_8xN 8, 32, ps ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_%3_12x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -9000,7 +12183,7 @@ lea r5, [8 * r1 - 8] sub r0, r5 -%ifidn %3,pp +%ifidn %3,pp add r2, 8 %else add r2, 16 @@ -9047,12 +12230,12 @@ ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_pp_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_12xN 12, 16, pp + FILTER_VER_LUMA_12xN 12, 16, pp ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_ps_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- -FILTER_VER_LUMA_12xN 12, 16, ps + FILTER_VER_LUMA_12xN 12, 16, ps %macro FILTER_VER_LUMA_AVX2_12x16 1 INIT_YMM avx2 @@ -9443,8 +12626,8 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_12x16 pp -FILTER_VER_LUMA_AVX2_12x16 ps + FILTER_VER_LUMA_AVX2_12x16 pp + FILTER_VER_LUMA_AVX2_12x16 ps %macro FILTER_VER_LUMA_AVX2_16x16 1 INIT_YMM avx2 @@ -9787,8 +12970,8 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_16x16 pp -FILTER_VER_LUMA_AVX2_16x16 ps + FILTER_VER_LUMA_AVX2_16x16 pp + FILTER_VER_LUMA_AVX2_16x16 ps %macro FILTER_VER_LUMA_AVX2_16x12 1 INIT_YMM avx2 @@ -10062,8 +13245,8 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_16x12 pp -FILTER_VER_LUMA_AVX2_16x12 ps + FILTER_VER_LUMA_AVX2_16x12 pp + FILTER_VER_LUMA_AVX2_16x12 ps %macro FILTER_VER_LUMA_AVX2_16x8 1 INIT_YMM avx2 @@ -10258,8 +13441,8 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_16x8 pp -FILTER_VER_LUMA_AVX2_16x8 ps + FILTER_VER_LUMA_AVX2_16x8 pp + FILTER_VER_LUMA_AVX2_16x8 ps %macro FILTER_VER_LUMA_AVX2_16x4 1 INIT_YMM avx2 @@ -10383,8 +13566,8 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_16x4 pp -FILTER_VER_LUMA_AVX2_16x4 ps + FILTER_VER_LUMA_AVX2_16x4 pp + FILTER_VER_LUMA_AVX2_16x4 ps %macro FILTER_VER_LUMA_AVX2_16xN 3 INIT_YMM avx2 %if ARCH_X86_64 == 1 @@ -10735,10 +13918,10 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_16xN 16, 32, pp -FILTER_VER_LUMA_AVX2_16xN 16, 64, pp -FILTER_VER_LUMA_AVX2_16xN 16, 32, ps -FILTER_VER_LUMA_AVX2_16xN 16, 64, ps + FILTER_VER_LUMA_AVX2_16xN 16, 32, pp + FILTER_VER_LUMA_AVX2_16xN 16, 64, pp + FILTER_VER_LUMA_AVX2_16xN 16, 32, ps + FILTER_VER_LUMA_AVX2_16xN 16, 64, ps %macro PROCESS_LUMA_AVX2_W16_16R 1 movu xm0, [r0] ; m0 = row 0 @@ -11466,8 +14649,8 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_24x32 pp -FILTER_VER_LUMA_AVX2_24x32 ps + FILTER_VER_LUMA_AVX2_24x32 pp + FILTER_VER_LUMA_AVX2_24x32 ps %macro FILTER_VER_LUMA_AVX2_32xN 3 INIT_YMM avx2 @@ -11517,10 +14700,10 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_32xN 32, 32, pp -FILTER_VER_LUMA_AVX2_32xN 32, 64, pp -FILTER_VER_LUMA_AVX2_32xN 32, 32, ps -FILTER_VER_LUMA_AVX2_32xN 32, 64, ps + FILTER_VER_LUMA_AVX2_32xN 32, 32, pp + FILTER_VER_LUMA_AVX2_32xN 32, 64, pp + FILTER_VER_LUMA_AVX2_32xN 32, 32, ps + FILTER_VER_LUMA_AVX2_32xN 32, 64, ps %macro FILTER_VER_LUMA_AVX2_32x16 1 INIT_YMM avx2 @@ -11560,9 +14743,9 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_32x16 pp -FILTER_VER_LUMA_AVX2_32x16 ps - + FILTER_VER_LUMA_AVX2_32x16 pp + FILTER_VER_LUMA_AVX2_32x16 ps + %macro FILTER_VER_LUMA_AVX2_32x24 1 INIT_YMM avx2 %if ARCH_X86_64 == 1 @@ -11620,8 +14803,8 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_32x24 pp -FILTER_VER_LUMA_AVX2_32x24 ps + FILTER_VER_LUMA_AVX2_32x24 pp + FILTER_VER_LUMA_AVX2_32x24 ps %macro FILTER_VER_LUMA_AVX2_32x8 1 INIT_YMM avx2 @@ -11663,8 +14846,8 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_32x8 pp -FILTER_VER_LUMA_AVX2_32x8 ps + FILTER_VER_LUMA_AVX2_32x8 pp + FILTER_VER_LUMA_AVX2_32x8 ps %macro FILTER_VER_LUMA_AVX2_48x64 1 INIT_YMM avx2 @@ -11722,8 +14905,8 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_48x64 pp -FILTER_VER_LUMA_AVX2_48x64 ps + FILTER_VER_LUMA_AVX2_48x64 pp + FILTER_VER_LUMA_AVX2_48x64 ps %macro FILTER_VER_LUMA_AVX2_64xN 3 INIT_YMM avx2 @@ -11781,12 +14964,12 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_64xN 64, 32, pp -FILTER_VER_LUMA_AVX2_64xN 64, 48, pp -FILTER_VER_LUMA_AVX2_64xN 64, 64, pp -FILTER_VER_LUMA_AVX2_64xN 64, 32, ps -FILTER_VER_LUMA_AVX2_64xN 64, 48, ps -FILTER_VER_LUMA_AVX2_64xN 64, 64, ps + FILTER_VER_LUMA_AVX2_64xN 64, 32, pp + FILTER_VER_LUMA_AVX2_64xN 64, 48, pp + FILTER_VER_LUMA_AVX2_64xN 64, 64, pp + FILTER_VER_LUMA_AVX2_64xN 64, 32, ps + FILTER_VER_LUMA_AVX2_64xN 64, 48, ps + FILTER_VER_LUMA_AVX2_64xN 64, 64, ps %macro FILTER_VER_LUMA_AVX2_64x16 1 INIT_YMM avx2 @@ -11832,8 +15015,8 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_64x16 pp -FILTER_VER_LUMA_AVX2_64x16 ps + FILTER_VER_LUMA_AVX2_64x16 pp + FILTER_VER_LUMA_AVX2_64x16 ps ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -11916,41 +15099,41 @@ RET %endmacro -FILTER_VER_LUMA 16, 4, pp -FILTER_VER_LUMA 16, 8, pp -FILTER_VER_LUMA 16, 12, pp -FILTER_VER_LUMA 16, 16, pp -FILTER_VER_LUMA 16, 32, pp -FILTER_VER_LUMA 16, 64, pp -FILTER_VER_LUMA 24, 32, pp -FILTER_VER_LUMA 32, 8, pp -FILTER_VER_LUMA 32, 16, pp -FILTER_VER_LUMA 32, 24, pp -FILTER_VER_LUMA 32, 32, pp -FILTER_VER_LUMA 32, 64, pp -FILTER_VER_LUMA 48, 64, pp -FILTER_VER_LUMA 64, 16, pp -FILTER_VER_LUMA 64, 32, pp -FILTER_VER_LUMA 64, 48, pp -FILTER_VER_LUMA 64, 64, pp - -FILTER_VER_LUMA 16, 4, ps -FILTER_VER_LUMA 16, 8, ps -FILTER_VER_LUMA 16, 12, ps -FILTER_VER_LUMA 16, 16, ps -FILTER_VER_LUMA 16, 32, ps -FILTER_VER_LUMA 16, 64, ps -FILTER_VER_LUMA 24, 32, ps -FILTER_VER_LUMA 32, 8, ps -FILTER_VER_LUMA 32, 16, ps -FILTER_VER_LUMA 32, 24, ps -FILTER_VER_LUMA 32, 32, ps -FILTER_VER_LUMA 32, 64, ps -FILTER_VER_LUMA 48, 64, ps -FILTER_VER_LUMA 64, 16, ps -FILTER_VER_LUMA 64, 32, ps -FILTER_VER_LUMA 64, 48, ps -FILTER_VER_LUMA 64, 64, ps + FILTER_VER_LUMA 16, 4, pp + FILTER_VER_LUMA 16, 8, pp + FILTER_VER_LUMA 16, 12, pp + FILTER_VER_LUMA 16, 16, pp + FILTER_VER_LUMA 16, 32, pp + FILTER_VER_LUMA 16, 64, pp + FILTER_VER_LUMA 24, 32, pp + FILTER_VER_LUMA 32, 8, pp + FILTER_VER_LUMA 32, 16, pp + FILTER_VER_LUMA 32, 24, pp + FILTER_VER_LUMA 32, 32, pp + FILTER_VER_LUMA 32, 64, pp + FILTER_VER_LUMA 48, 64, pp + FILTER_VER_LUMA 64, 16, pp + FILTER_VER_LUMA 64, 32, pp + FILTER_VER_LUMA 64, 48, pp + FILTER_VER_LUMA 64, 64, pp + + FILTER_VER_LUMA 16, 4, ps + FILTER_VER_LUMA 16, 8, ps + FILTER_VER_LUMA 16, 12, ps + FILTER_VER_LUMA 16, 16, ps + FILTER_VER_LUMA 16, 32, ps + FILTER_VER_LUMA 16, 64, ps + FILTER_VER_LUMA 24, 32, ps + FILTER_VER_LUMA 32, 8, ps + FILTER_VER_LUMA 32, 16, ps + FILTER_VER_LUMA 32, 24, ps + FILTER_VER_LUMA 32, 32, ps + FILTER_VER_LUMA 32, 64, ps + FILTER_VER_LUMA 48, 64, ps + FILTER_VER_LUMA 64, 16, ps + FILTER_VER_LUMA 64, 32, ps + FILTER_VER_LUMA 64, 48, ps + FILTER_VER_LUMA 64, 64, ps %macro PROCESS_LUMA_SP_W4_4R 0 movq m0, [r0] @@ -12036,7 +15219,7 @@ lea r6, [tab_LumaCoeffV + r4] %endif - mova m7, [tab_c_526336] + mova m7, [pd_526336] mov dword [rsp], %2/4 .loopH: @@ -12110,63 +15293,49 @@ FILTER_VER_LUMA_SP 64, 16 FILTER_VER_LUMA_SP 16, 64 -; TODO: combin of U and V is more performance, but need more register -; TODO: use two path for height alignment to 4 and otherwise may improvement 10% performance, but code is more complex, so I disable it -INIT_XMM ssse3 -cglobal chroma_p2s, 3, 7, 4 - - ; load width and height +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal filterPixelToShort_4x2, 3, 4, 3 mov r3d, r3m - mov r4d, r4m + add r3d, r3d ; load constant - mova m2, [pb_128] - mova m3, [tab_c_64_n64] + mova m1, [pb_128] + mova m2, [tab_c_64_n64] -.loopH: - - xor r5d, r5d -.loopW: - lea r6, [r0 + r5] + movd m0, [r0] + pinsrd m0, [r0 + r1], 1 + punpcklbw m0, m1 + pmaddubsw m0, m2 - movh m0, [r6] - punpcklbw m0, m2 - pmaddubsw m0, m3 + movq [r2 + r3 * 0], m0 + movhps [r2 + r3 * 1], m0 - movh m1, [r6 + r1] - punpcklbw m1, m2 - pmaddubsw m1, m3 + RET - add r5d, 8 - cmp r5d, r3d - lea r6, [r2 + r5 * 2] - jg .width4 - movu [r6 + FENC_STRIDE / 2 * 0 - 16], m0 - movu [r6 + FENC_STRIDE / 2 * 2 - 16], m1 - je .nextH - jmp .loopW - -.width4: - test r3d, 4 - jz .width2 - test r3d, 2 - movh [r6 + FENC_STRIDE / 2 * 0 - 16], m0 - movh [r6 + FENC_STRIDE / 2 * 2 - 16], m1 - lea r6, [r6 + 8] - pshufd m0, m0, 2 - pshufd m1, m1, 2 - jz .nextH - -.width2: - movd [r6 + FENC_STRIDE / 2 * 0 - 16], m0 - movd [r6 + FENC_STRIDE / 2 * 2 - 16], m1 +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +INIT_XMM ssse3 +cglobal filterPixelToShort_8x2, 3, 4, 3 + mov r3d, r3m + add r3d, r3d -.nextH: - lea r0, [r0 + r1 * 2] - add r2, FENC_STRIDE / 2 * 4 + ; load constant + mova m1, [pb_128] + mova m2, [tab_c_64_n64] - sub r4d, 2 - jnz .loopH + movh m0, [r0] + punpcklbw m0, m1 + pmaddubsw m0, m2 + movu [r2 + r3 * 0], m0 + + movh m0, [r0 + r1] + punpcklbw m0, m1 + pmaddubsw m0, m2 + movu [r2 + r3 * 1], m0 RET @@ -12223,7 +15392,7 @@ lea r6, [tab_ChromaCoeffV + r4] %endif - mova m6, [tab_c_526336] + mova m6, [pd_526336] mov dword [rsp], %2/4 @@ -12350,7 +15519,7 @@ lea r5, [tab_ChromaCoeffV + r4] %endif - mova m5, [tab_c_526336] + mova m5, [pd_526336] mov r4d, (%2/4) @@ -12380,10 +15549,10 @@ RET %endmacro -FILTER_VER_CHROMA_SP_W2_4R 2, 4 -FILTER_VER_CHROMA_SP_W2_4R 2, 8 + FILTER_VER_CHROMA_SP_W2_4R 2, 4 + FILTER_VER_CHROMA_SP_W2_4R 2, 8 -FILTER_VER_CHROMA_SP_W2_4R 2, 16 + FILTER_VER_CHROMA_SP_W2_4R 2, 16 ;-------------------------------------------------------------------------------------------------------------- ; void interp_4tap_vert_sp_4x2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -12402,7 +15571,7 @@ lea r5, [tab_ChromaCoeffV + r4] %endif - mova m4, [tab_c_526336] + mova m4, [pd_526336] movq m0, [r0] movq m1, [r0 + r1] @@ -12454,7 +15623,7 @@ lea r6, [tab_ChromaCoeffV + r4] %endif - mova m6, [tab_c_526336] + mova m6, [pd_526336] mov r4d, %2/4 @@ -12512,9 +15681,9 @@ RET %endmacro -FILTER_VER_CHROMA_SP_W6_H4 6, 8 + FILTER_VER_CHROMA_SP_W6_H4 6, 8 -FILTER_VER_CHROMA_SP_W6_H4 6, 16 + FILTER_VER_CHROMA_SP_W6_H4 6, 16 %macro PROCESS_CHROMA_SP_W8_2R 0 movu m1, [r0] @@ -12566,7 +15735,7 @@ lea r5, [tab_ChromaCoeffV + r4] %endif - mova m7, [tab_c_526336] + mova m7, [pd_526336] mov r4d, %2/2 .loopH: @@ -12598,15 +15767,15 @@ RET %endmacro -FILTER_VER_CHROMA_SP_W8_H2 8, 2 -FILTER_VER_CHROMA_SP_W8_H2 8, 4 -FILTER_VER_CHROMA_SP_W8_H2 8, 6 -FILTER_VER_CHROMA_SP_W8_H2 8, 8 -FILTER_VER_CHROMA_SP_W8_H2 8, 16 -FILTER_VER_CHROMA_SP_W8_H2 8, 32 + FILTER_VER_CHROMA_SP_W8_H2 8, 2 + FILTER_VER_CHROMA_SP_W8_H2 8, 4 + FILTER_VER_CHROMA_SP_W8_H2 8, 6 + FILTER_VER_CHROMA_SP_W8_H2 8, 8 + FILTER_VER_CHROMA_SP_W8_H2 8, 16 + FILTER_VER_CHROMA_SP_W8_H2 8, 32 -FILTER_VER_CHROMA_SP_W8_H2 8, 12 -FILTER_VER_CHROMA_SP_W8_H2 8, 64 + FILTER_VER_CHROMA_SP_W8_H2 8, 12 + FILTER_VER_CHROMA_SP_W8_H2 8, 64 ;----------------------------------------------------------------------------------------------------------------------------- @@ -12658,10 +15827,10 @@ RET %endmacro -FILTER_HORIZ_CHROMA_2xN 2, 4 -FILTER_HORIZ_CHROMA_2xN 2, 8 + FILTER_HORIZ_CHROMA_2xN 2, 4 + FILTER_HORIZ_CHROMA_2xN 2, 8 -FILTER_HORIZ_CHROMA_2xN 2, 16 + FILTER_HORIZ_CHROMA_2xN 2, 16 ;----------------------------------------------------------------------------------------------------------------------------- ; void interp_4tap_horiz_ps_4x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) @@ -12711,12 +15880,12 @@ RET %endmacro -FILTER_HORIZ_CHROMA_4xN 4, 2 -FILTER_HORIZ_CHROMA_4xN 4, 4 -FILTER_HORIZ_CHROMA_4xN 4, 8 -FILTER_HORIZ_CHROMA_4xN 4, 16 + FILTER_HORIZ_CHROMA_4xN 4, 2 + FILTER_HORIZ_CHROMA_4xN 4, 4 + FILTER_HORIZ_CHROMA_4xN 4, 8 + FILTER_HORIZ_CHROMA_4xN 4, 16 -FILTER_HORIZ_CHROMA_4xN 4, 32 + FILTER_HORIZ_CHROMA_4xN 4, 32 %macro PROCESS_CHROMA_W6 3 movu %1, [srcq] @@ -12794,11 +15963,11 @@ RET %endmacro -FILTER_HORIZ_CHROMA 6, 8 -FILTER_HORIZ_CHROMA 12, 16 + FILTER_HORIZ_CHROMA 6, 8 + FILTER_HORIZ_CHROMA 12, 16 -FILTER_HORIZ_CHROMA 6, 16 -FILTER_HORIZ_CHROMA 12, 32 + FILTER_HORIZ_CHROMA 6, 16 + FILTER_HORIZ_CHROMA 12, 32 %macro PROCESS_CHROMA_W8 3 movu %1, [srcq] @@ -12857,15 +16026,15 @@ RET %endmacro -FILTER_HORIZ_CHROMA_8xN 8, 2 -FILTER_HORIZ_CHROMA_8xN 8, 4 -FILTER_HORIZ_CHROMA_8xN 8, 6 -FILTER_HORIZ_CHROMA_8xN 8, 8 -FILTER_HORIZ_CHROMA_8xN 8, 16 -FILTER_HORIZ_CHROMA_8xN 8, 32 + FILTER_HORIZ_CHROMA_8xN 8, 2 + FILTER_HORIZ_CHROMA_8xN 8, 4 + FILTER_HORIZ_CHROMA_8xN 8, 6 + FILTER_HORIZ_CHROMA_8xN 8, 8 + FILTER_HORIZ_CHROMA_8xN 8, 16 + FILTER_HORIZ_CHROMA_8xN 8, 32 -FILTER_HORIZ_CHROMA_8xN 8, 12 -FILTER_HORIZ_CHROMA_8xN 8, 64 + FILTER_HORIZ_CHROMA_8xN 8, 12 + FILTER_HORIZ_CHROMA_8xN 8, 64 %macro PROCESS_CHROMA_W16 4 movu %1, [srcq] @@ -13027,28 +16196,28 @@ RET %endmacro -FILTER_HORIZ_CHROMA_WxN 16, 4 -FILTER_HORIZ_CHROMA_WxN 16, 8 -FILTER_HORIZ_CHROMA_WxN 16, 12 -FILTER_HORIZ_CHROMA_WxN 16, 16 -FILTER_HORIZ_CHROMA_WxN 16, 32 -FILTER_HORIZ_CHROMA_WxN 24, 32 -FILTER_HORIZ_CHROMA_WxN 32, 8 -FILTER_HORIZ_CHROMA_WxN 32, 16 -FILTER_HORIZ_CHROMA_WxN 32, 24 -FILTER_HORIZ_CHROMA_WxN 32, 32 - -FILTER_HORIZ_CHROMA_WxN 16, 24 -FILTER_HORIZ_CHROMA_WxN 16, 64 -FILTER_HORIZ_CHROMA_WxN 24, 64 -FILTER_HORIZ_CHROMA_WxN 32, 48 -FILTER_HORIZ_CHROMA_WxN 32, 64 - -FILTER_HORIZ_CHROMA_WxN 64, 64 -FILTER_HORIZ_CHROMA_WxN 64, 32 -FILTER_HORIZ_CHROMA_WxN 64, 48 -FILTER_HORIZ_CHROMA_WxN 48, 64 -FILTER_HORIZ_CHROMA_WxN 64, 16 + FILTER_HORIZ_CHROMA_WxN 16, 4 + FILTER_HORIZ_CHROMA_WxN 16, 8 + FILTER_HORIZ_CHROMA_WxN 16, 12 + FILTER_HORIZ_CHROMA_WxN 16, 16 + FILTER_HORIZ_CHROMA_WxN 16, 32 + FILTER_HORIZ_CHROMA_WxN 24, 32 + FILTER_HORIZ_CHROMA_WxN 32, 8 + FILTER_HORIZ_CHROMA_WxN 32, 16 + FILTER_HORIZ_CHROMA_WxN 32, 24 + FILTER_HORIZ_CHROMA_WxN 32, 32 + + FILTER_HORIZ_CHROMA_WxN 16, 24 + FILTER_HORIZ_CHROMA_WxN 16, 64 + FILTER_HORIZ_CHROMA_WxN 24, 64 + FILTER_HORIZ_CHROMA_WxN 32, 48 + FILTER_HORIZ_CHROMA_WxN 32, 64 + + FILTER_HORIZ_CHROMA_WxN 64, 64 + FILTER_HORIZ_CHROMA_WxN 64, 32 + FILTER_HORIZ_CHROMA_WxN 64, 48 + FILTER_HORIZ_CHROMA_WxN 48, 64 + FILTER_HORIZ_CHROMA_WxN 64, 16 ;--------------------------------------------------------------------------------------------------------------- @@ -13144,11 +16313,11 @@ RET %endmacro -FILTER_V_PS_W16n 64, 64 -FILTER_V_PS_W16n 64, 32 -FILTER_V_PS_W16n 64, 48 -FILTER_V_PS_W16n 48, 64 -FILTER_V_PS_W16n 64, 16 + FILTER_V_PS_W16n 64, 64 + FILTER_V_PS_W16n 64, 32 + FILTER_V_PS_W16n 64, 48 + FILTER_V_PS_W16n 48, 64 + FILTER_V_PS_W16n 64, 16 ;------------------------------------------------------------------------------------------------------------ @@ -13306,12 +16475,12 @@ dec r4d jnz .loop -RET + RET %endmacro -FILTER_V_PS_W2 2, 8 + FILTER_V_PS_W2 2, 8 -FILTER_V_PS_W2 2, 16 + FILTER_V_PS_W2 2, 16 ;----------------------------------------------------------------------------------------------------------------- ; void interp_4tap_vert_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) @@ -13472,8 +16641,8 @@ RET %endmacro -FILTER_VER_CHROMA_S_AVX2_4x4 sp -FILTER_VER_CHROMA_S_AVX2_4x4 ss + FILTER_VER_CHROMA_S_AVX2_4x4 sp + FILTER_VER_CHROMA_S_AVX2_4x4 ss %macro FILTER_VER_CHROMA_S_AVX2_4x8 1 INIT_YMM avx2 @@ -13584,8 +16753,8 @@ RET %endmacro -FILTER_VER_CHROMA_S_AVX2_4x8 sp -FILTER_VER_CHROMA_S_AVX2_4x8 ss + FILTER_VER_CHROMA_S_AVX2_4x8 sp + FILTER_VER_CHROMA_S_AVX2_4x8 ss %macro PROCESS_CHROMA_AVX2_W4_16R 1 movq xm0, [r0] @@ -13779,8 +16948,40 @@ RET %endmacro -FILTER_VER_CHROMA_S_AVX2_4x16 sp -FILTER_VER_CHROMA_S_AVX2_4x16 ss + FILTER_VER_CHROMA_S_AVX2_4x16 sp + FILTER_VER_CHROMA_S_AVX2_4x16 ss + +%macro FILTER_VER_CHROMA_S_AVX2_4x32 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x32, 4, 7, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + sub r0, r1 + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] +%ifidn %1,sp + mova m7, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] +%rep 2 + PROCESS_CHROMA_AVX2_W4_16R %1 + lea r2, [r2 + r3 * 4] +%endrep + RET +%endmacro + + FILTER_VER_CHROMA_S_AVX2_4x32 sp + FILTER_VER_CHROMA_S_AVX2_4x32 ss %macro FILTER_VER_CHROMA_S_AVX2_4x2 1 INIT_YMM avx2 @@ -13836,8 +17037,8 @@ RET %endmacro -FILTER_VER_CHROMA_S_AVX2_4x2 sp -FILTER_VER_CHROMA_S_AVX2_4x2 ss + FILTER_VER_CHROMA_S_AVX2_4x2 sp + FILTER_VER_CHROMA_S_AVX2_4x2 ss %macro FILTER_VER_CHROMA_S_AVX2_2x4 1 INIT_YMM avx2 @@ -13906,8 +17107,8 @@ RET %endmacro -FILTER_VER_CHROMA_S_AVX2_2x4 sp -FILTER_VER_CHROMA_S_AVX2_2x4 ss + FILTER_VER_CHROMA_S_AVX2_2x4 sp + FILTER_VER_CHROMA_S_AVX2_2x4 ss %macro FILTER_VER_CHROMA_S_AVX2_8x8 1 INIT_YMM avx2 @@ -14085,8 +17286,8 @@ RET %endmacro -FILTER_VER_CHROMA_S_AVX2_8x8 sp -FILTER_VER_CHROMA_S_AVX2_8x8 ss + FILTER_VER_CHROMA_S_AVX2_8x8 sp + FILTER_VER_CHROMA_S_AVX2_8x8 ss %macro PROCESS_CHROMA_S_AVX2_W8_16R 1 movu xm0, [r0] ; m0 = row 0 @@ -14401,10 +17602,12 @@ %endif %endmacro -FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 16 -FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 32 -FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 16 -FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 32 + FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 16 + FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 32 + FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 64 + FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 16 + FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 32 + FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 64 %macro FILTER_VER_CHROMA_S_AVX2_NxN 3 INIT_YMM avx2 @@ -14453,12 +17656,28 @@ %endif %endmacro -FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, sp -FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, sp -FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, sp -FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, ss -FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, ss -FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, ss + FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, sp + FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, sp + FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, sp + FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, ss + FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, ss + FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, ss + FILTER_VER_CHROMA_S_AVX2_NxN 16, 64, sp + FILTER_VER_CHROMA_S_AVX2_NxN 24, 64, sp + FILTER_VER_CHROMA_S_AVX2_NxN 32, 64, sp + FILTER_VER_CHROMA_S_AVX2_NxN 32, 48, sp + FILTER_VER_CHROMA_S_AVX2_NxN 32, 48, ss + FILTER_VER_CHROMA_S_AVX2_NxN 16, 64, ss + FILTER_VER_CHROMA_S_AVX2_NxN 24, 64, ss + FILTER_VER_CHROMA_S_AVX2_NxN 32, 64, ss + FILTER_VER_CHROMA_S_AVX2_NxN 64, 64, sp + FILTER_VER_CHROMA_S_AVX2_NxN 64, 32, sp + FILTER_VER_CHROMA_S_AVX2_NxN 64, 48, sp + FILTER_VER_CHROMA_S_AVX2_NxN 48, 64, sp + FILTER_VER_CHROMA_S_AVX2_NxN 64, 64, ss + FILTER_VER_CHROMA_S_AVX2_NxN 64, 32, ss + FILTER_VER_CHROMA_S_AVX2_NxN 64, 48, ss + FILTER_VER_CHROMA_S_AVX2_NxN 48, 64, ss %macro PROCESS_CHROMA_S_AVX2_W8_4R 1 movu xm0, [r0] ; m0 = row 0 @@ -14567,8 +17786,8 @@ RET %endmacro -FILTER_VER_CHROMA_S_AVX2_8x4 sp -FILTER_VER_CHROMA_S_AVX2_8x4 ss + FILTER_VER_CHROMA_S_AVX2_8x4 sp + FILTER_VER_CHROMA_S_AVX2_8x4 ss %macro FILTER_VER_CHROMA_S_AVX2_12x16 1 INIT_YMM avx2 @@ -14606,8 +17825,55 @@ %endif %endmacro -FILTER_VER_CHROMA_S_AVX2_12x16 sp -FILTER_VER_CHROMA_S_AVX2_12x16 ss + FILTER_VER_CHROMA_S_AVX2_12x16 sp + FILTER_VER_CHROMA_S_AVX2_12x16 ss + +%macro FILTER_VER_CHROMA_S_AVX2_12x32 1 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_12x32, 4, 9, 10 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1, sp + mova m9, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] +%rep 2 + PROCESS_CHROMA_S_AVX2_W8_16R %1 +%ifidn %1, sp + add r2, 8 +%else + add r2, 16 +%endif + add r0, 16 + mova m7, m9 + PROCESS_CHROMA_AVX2_W4_16R %1 + sub r0, 16 +%ifidn %1, sp + lea r2, [r2 + r3 * 4 - 8] +%else + lea r2, [r2 + r3 * 4 - 16] +%endif +%endrep + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_12x32 sp + FILTER_VER_CHROMA_S_AVX2_12x32 ss %macro FILTER_VER_CHROMA_S_AVX2_16x12 1 INIT_YMM avx2 @@ -14860,8 +18126,257 @@ %endif %endmacro -FILTER_VER_CHROMA_S_AVX2_16x12 sp -FILTER_VER_CHROMA_S_AVX2_16x12 ss + FILTER_VER_CHROMA_S_AVX2_16x12 sp + FILTER_VER_CHROMA_S_AVX2_16x12 ss + +%macro FILTER_VER_CHROMA_S_AVX2_8x12 1 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x12, 4, 7, 9 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m8, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] +%ifidn %1,sp + paddd m0, m8 + paddd m1, m8 + psrad m0, 12 + psrad m1, 12 +%else + psrad m0, 6 + psrad m1, 6 +%endif + packssdw m0, m1 + + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm1, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m1 +%ifidn %1,sp + paddd m2, m8 + paddd m3, m8 + psrad m2, 12 + psrad m3, 12 +%else + psrad m2, 6 + psrad m3, 6 +%endif + packssdw m2, m3 +%ifidn %1,sp + packuswb m0, m2 + mova m3, [interp8_hps_shuf] + vpermd m0, m3, m0 + vextracti128 xm2, m0, 1 + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r6], xm2 +%else + vpermq m0, m0, 11011000b + vpermq m2, m2, 11011000b + movu [r2], xm0 + vextracti128 xm0, m0, 1 + vextracti128 xm3, m2, 1 + movu [r2 + r3], xm0 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm3 +%endif + lea r2, [r2 + r3 * 4] + + movu xm1, [r0 + r4] ; m1 = row 7 + punpckhwd xm0, xm6, xm1 + punpcklwd xm6, xm1 + vinserti128 m6, m6, xm0, 1 + pmaddwd m0, m6, [r5 + 1 * mmsize] + pmaddwd m6, [r5] + paddd m4, m0 + lea r0, [r0 + r1 * 4] + movu xm0, [r0] ; m0 = row 8 + punpckhwd xm2, xm1, xm0 + punpcklwd xm1, xm0 + vinserti128 m1, m1, xm2, 1 + pmaddwd m2, m1, [r5 + 1 * mmsize] + pmaddwd m1, [r5] + paddd m5, m2 +%ifidn %1,sp + paddd m4, m8 + paddd m5, m8 + psrad m4, 12 + psrad m5, 12 +%else + psrad m4, 6 + psrad m5, 6 +%endif + packssdw m4, m5 + + movu xm2, [r0 + r1] ; m2 = row 9 + punpckhwd xm5, xm0, xm2 + punpcklwd xm0, xm2 + vinserti128 m0, m0, xm5, 1 + pmaddwd m5, m0, [r5 + 1 * mmsize] + paddd m6, m5 + pmaddwd m0, [r5] + movu xm5, [r0 + r1 * 2] ; m5 = row 10 + punpckhwd xm7, xm2, xm5 + punpcklwd xm2, xm5 + vinserti128 m2, m2, xm7, 1 + pmaddwd m7, m2, [r5 + 1 * mmsize] + paddd m1, m7 + pmaddwd m2, [r5] + +%ifidn %1,sp + paddd m6, m8 + paddd m1, m8 + psrad m6, 12 + psrad m1, 12 +%else + psrad m6, 6 + psrad m1, 6 +%endif + packssdw m6, m1 +%ifidn %1,sp + packuswb m4, m6 + vpermd m4, m3, m4 + vextracti128 xm6, m4, 1 + movq [r2], xm4 + movhps [r2 + r3], xm4 + movq [r2 + r3 * 2], xm6 + movhps [r2 + r6], xm6 +%else + vpermq m4, m4, 11011000b + vpermq m6, m6, 11011000b + vextracti128 xm7, m4, 1 + vextracti128 xm1, m6, 1 + movu [r2], xm4 + movu [r2 + r3], xm7 + movu [r2 + r3 * 2], xm6 + movu [r2 + r6], xm1 +%endif + lea r2, [r2 + r3 * 4] + + movu xm7, [r0 + r4] ; m7 = row 11 + punpckhwd xm1, xm5, xm7 + punpcklwd xm5, xm7 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + paddd m0, m1 + pmaddwd m5, [r5] + lea r0, [r0 + r1 * 4] + movu xm1, [r0] ; m1 = row 12 + punpckhwd xm4, xm7, xm1 + punpcklwd xm7, xm1 + vinserti128 m7, m7, xm4, 1 + pmaddwd m4, m7, [r5 + 1 * mmsize] + paddd m2, m4 + pmaddwd m7, [r5] +%ifidn %1,sp + paddd m0, m8 + paddd m2, m8 + psrad m0, 12 + psrad m2, 12 +%else + psrad m0, 6 + psrad m2, 6 +%endif + packssdw m0, m2 + + movu xm4, [r0 + r1] ; m4 = row 13 + punpckhwd xm2, xm1, xm4 + punpcklwd xm1, xm4 + vinserti128 m1, m1, xm2, 1 + pmaddwd m1, [r5 + 1 * mmsize] + paddd m5, m1 + movu xm2, [r0 + r1 * 2] ; m2 = row 14 + punpckhwd xm6, xm4, xm2 + punpcklwd xm4, xm2 + vinserti128 m4, m4, xm6, 1 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m7, m4 +%ifidn %1,sp + paddd m5, m8 + paddd m7, m8 + psrad m5, 12 + psrad m7, 12 +%else + psrad m5, 6 + psrad m7, 6 +%endif + packssdw m5, m7 +%ifidn %1,sp + packuswb m0, m5 + vpermd m0, m3, m0 + vextracti128 xm5, m0, 1 + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm5 + movhps [r2 + r6], xm5 +%else + vpermq m0, m0, 11011000b + vpermq m5, m5, 11011000b + vextracti128 xm7, m0, 1 + vextracti128 xm6, m5, 1 + movu [r2], xm0 + movu [r2 + r3], xm7 + movu [r2 + r3 * 2], xm5 + movu [r2 + r6], xm6 +%endif + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_8x12 sp + FILTER_VER_CHROMA_S_AVX2_8x12 ss %macro FILTER_VER_CHROMA_S_AVX2_16x4 1 INIT_YMM avx2 @@ -14906,8 +18421,8 @@ RET %endmacro -FILTER_VER_CHROMA_S_AVX2_16x4 sp -FILTER_VER_CHROMA_S_AVX2_16x4 ss + FILTER_VER_CHROMA_S_AVX2_16x4 sp + FILTER_VER_CHROMA_S_AVX2_16x4 ss %macro PROCESS_CHROMA_S_AVX2_W8_8R 1 movu xm0, [r0] ; m0 = row 0 @@ -15097,10 +18612,10 @@ %endif %endmacro -FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 32 -FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 16 -FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 32 -FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 16 + FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 32 + FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 16 + FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 32 + FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 16 %macro FILTER_VER_CHROMA_S_AVX2_8x2 1 INIT_YMM avx2 @@ -15172,8 +18687,8 @@ RET %endmacro -FILTER_VER_CHROMA_S_AVX2_8x2 sp -FILTER_VER_CHROMA_S_AVX2_8x2 ss + FILTER_VER_CHROMA_S_AVX2_8x2 sp + FILTER_VER_CHROMA_S_AVX2_8x2 ss %macro FILTER_VER_CHROMA_S_AVX2_8x6 1 INIT_YMM avx2 @@ -15315,8 +18830,8 @@ RET %endmacro -FILTER_VER_CHROMA_S_AVX2_8x6 sp -FILTER_VER_CHROMA_S_AVX2_8x6 ss + FILTER_VER_CHROMA_S_AVX2_8x6 sp + FILTER_VER_CHROMA_S_AVX2_8x6 ss %macro FILTER_VER_CHROMA_S_AVX2_8xN 2 INIT_YMM avx2 @@ -15637,15 +19152,17 @@ %endif %endmacro -FILTER_VER_CHROMA_S_AVX2_8xN sp, 16 -FILTER_VER_CHROMA_S_AVX2_8xN sp, 32 -FILTER_VER_CHROMA_S_AVX2_8xN ss, 16 -FILTER_VER_CHROMA_S_AVX2_8xN ss, 32 + FILTER_VER_CHROMA_S_AVX2_8xN sp, 16 + FILTER_VER_CHROMA_S_AVX2_8xN sp, 32 + FILTER_VER_CHROMA_S_AVX2_8xN sp, 64 + FILTER_VER_CHROMA_S_AVX2_8xN ss, 16 + FILTER_VER_CHROMA_S_AVX2_8xN ss, 32 + FILTER_VER_CHROMA_S_AVX2_8xN ss, 64 -%macro FILTER_VER_CHROMA_S_AVX2_32x24 1 -INIT_YMM avx2 +%macro FILTER_VER_CHROMA_S_AVX2_Nx24 2 %if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_32x24, 4, 10, 10 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_%2x24, 4, 10, 10 mov r4d, r4m shl r4d, 6 add r1d, r1d @@ -15665,7 +19182,7 @@ add r3d, r3d %endif lea r6, [r3 * 3] - mov r9d, 4 + mov r9d, %2 / 8 .loopW: PROCESS_CHROMA_S_AVX2_W8_16R %1 %ifidn %1,sp @@ -15677,13 +19194,13 @@ dec r9d jnz .loopW %ifidn %1,sp - lea r2, [r8 + r3 * 4 - 24] + lea r2, [r8 + r3 * 4 - %2 + 8] %else - lea r2, [r8 + r3 * 4 - 48] + lea r2, [r8 + r3 * 4 - 2 * %2 + 16] %endif - lea r0, [r7 - 48] + lea r0, [r7 - 2 * %2 + 16] mova m7, m9 - mov r9d, 4 + mov r9d, %2 / 8 .loop: PROCESS_CHROMA_S_AVX2_W8_8R %1 %ifidn %1,sp @@ -15698,8 +19215,10 @@ %endif %endmacro -FILTER_VER_CHROMA_S_AVX2_32x24 sp -FILTER_VER_CHROMA_S_AVX2_32x24 ss + FILTER_VER_CHROMA_S_AVX2_Nx24 sp, 32 + FILTER_VER_CHROMA_S_AVX2_Nx24 sp, 16 + FILTER_VER_CHROMA_S_AVX2_Nx24 ss, 32 + FILTER_VER_CHROMA_S_AVX2_Nx24 ss, 16 %macro FILTER_VER_CHROMA_S_AVX2_2x8 1 INIT_YMM avx2 @@ -15797,8 +19316,170 @@ RET %endmacro -FILTER_VER_CHROMA_S_AVX2_2x8 sp -FILTER_VER_CHROMA_S_AVX2_2x8 ss + FILTER_VER_CHROMA_S_AVX2_2x8 sp + FILTER_VER_CHROMA_S_AVX2_2x8 ss + +%macro FILTER_VER_CHROMA_S_AVX2_2x16 1 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_2x16, 4, 6, 9 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + sub r0, r1 + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] +%ifidn %1,sp + mova m6, [pd_526336] +%else + add r3d, r3d +%endif + movd xm0, [r0] + movd xm1, [r0 + r1] + punpcklwd xm0, xm1 + movd xm2, [r0 + r1 * 2] + punpcklwd xm1, xm2 + punpcklqdq xm0, xm1 ; m0 = [2 1 1 0] + movd xm3, [r0 + r4] + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movd xm4, [r0] + punpcklwd xm3, xm4 + punpcklqdq xm2, xm3 ; m2 = [4 3 3 2] + vinserti128 m0, m0, xm2, 1 ; m0 = [4 3 3 2 2 1 1 0] + movd xm1, [r0 + r1] + punpcklwd xm4, xm1 + movd xm3, [r0 + r1 * 2] + punpcklwd xm1, xm3 + punpcklqdq xm4, xm1 ; m4 = [6 5 5 4] + vinserti128 m2, m2, xm4, 1 ; m2 = [6 5 5 4 4 3 3 2] + pmaddwd m0, [r5] + pmaddwd m2, [r5 + 1 * mmsize] + paddd m0, m2 + movd xm1, [r0 + r4] + punpcklwd xm3, xm1 + lea r0, [r0 + 4 * r1] + movd xm2, [r0] + punpcklwd xm1, xm2 + punpcklqdq xm3, xm1 ; m3 = [8 7 7 6] + vinserti128 m4, m4, xm3, 1 ; m4 = [8 7 7 6 6 5 5 4] + movd xm1, [r0 + r1] + punpcklwd xm2, xm1 + movd xm5, [r0 + r1 * 2] + punpcklwd xm1, xm5 + punpcklqdq xm2, xm1 ; m2 = [10 9 9 8] + vinserti128 m3, m3, xm2, 1 ; m3 = [10 9 9 8 8 7 7 6] + pmaddwd m4, [r5] + pmaddwd m3, [r5 + 1 * mmsize] + paddd m4, m3 + movd xm1, [r0 + r4] + punpcklwd xm5, xm1 + lea r0, [r0 + 4 * r1] + movd xm3, [r0] + punpcklwd xm1, xm3 + punpcklqdq xm5, xm1 ; m5 = [12 11 11 10] + vinserti128 m2, m2, xm5, 1 ; m2 = [12 11 11 10 10 9 9 8] + movd xm1, [r0 + r1] + punpcklwd xm3, xm1 + movd xm7, [r0 + r1 * 2] + punpcklwd xm1, xm7 + punpcklqdq xm3, xm1 ; m3 = [14 13 13 12] + vinserti128 m5, m5, xm3, 1 ; m5 = [14 13 13 12 12 11 11 10] + pmaddwd m2, [r5] + pmaddwd m5, [r5 + 1 * mmsize] + paddd m2, m5 + movd xm5, [r0 + r4] + punpcklwd xm7, xm5 + lea r0, [r0 + 4 * r1] + movd xm1, [r0] + punpcklwd xm5, xm1 + punpcklqdq xm7, xm5 ; m7 = [16 15 15 14] + vinserti128 m3, m3, xm7, 1 ; m3 = [16 15 15 14 14 13 13 12] + movd xm5, [r0 + r1] + punpcklwd xm1, xm5 + movd xm8, [r0 + r1 * 2] + punpcklwd xm5, xm8 + punpcklqdq xm1, xm5 ; m1 = [18 17 17 16] + vinserti128 m7, m7, xm1, 1 ; m7 = [18 17 17 16 16 15 15 14] + pmaddwd m3, [r5] + pmaddwd m7, [r5 + 1 * mmsize] + paddd m3, m7 +%ifidn %1,sp + paddd m0, m6 + paddd m4, m6 + paddd m2, m6 + paddd m3, m6 + psrad m0, 12 + psrad m4, 12 + psrad m2, 12 + psrad m3, 12 +%else + psrad m0, 6 + psrad m4, 6 + psrad m2, 6 + psrad m3, 6 +%endif + packssdw m0, m4 + packssdw m2, m3 + lea r4, [r3 * 3] +%ifidn %1,sp + packuswb m0, m2 + vextracti128 xm2, m0, 1 + pextrw [r2], xm0, 0 + pextrw [r2 + r3], xm0, 1 + pextrw [r2 + 2 * r3], xm2, 0 + pextrw [r2 + r4], xm2, 1 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm0, 2 + pextrw [r2 + r3], xm0, 3 + pextrw [r2 + 2 * r3], xm2, 2 + pextrw [r2 + r4], xm2, 3 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm0, 4 + pextrw [r2 + r3], xm0, 5 + pextrw [r2 + 2 * r3], xm2, 4 + pextrw [r2 + r4], xm2, 5 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm0, 6 + pextrw [r2 + r3], xm0, 7 + pextrw [r2 + 2 * r3], xm2, 6 + pextrw [r2 + r4], xm2, 7 +%else + vextracti128 xm4, m0, 1 + vextracti128 xm3, m2, 1 + movd [r2], xm0 + pextrd [r2 + r3], xm0, 1 + movd [r2 + 2 * r3], xm4 + pextrd [r2 + r4], xm4, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm0, 2 + pextrd [r2 + r3], xm0, 3 + pextrd [r2 + 2 * r3], xm4, 2 + pextrd [r2 + r4], xm4, 3 + lea r2, [r2 + r3 * 4] + movd [r2], xm2 + pextrd [r2 + r3], xm2, 1 + movd [r2 + 2 * r3], xm3 + pextrd [r2 + r4], xm3, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm2, 2 + pextrd [r2 + r3], xm2, 3 + pextrd [r2 + 2 * r3], xm3, 2 + pextrd [r2 + r4], xm3, 3 +%endif + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_2x16 sp + FILTER_VER_CHROMA_S_AVX2_2x16 ss %macro FILTER_VER_CHROMA_S_AVX2_6x8 1 INIT_YMM avx2 @@ -15985,8 +19666,344 @@ RET %endmacro -FILTER_VER_CHROMA_S_AVX2_6x8 sp -FILTER_VER_CHROMA_S_AVX2_6x8 ss + FILTER_VER_CHROMA_S_AVX2_6x8 sp + FILTER_VER_CHROMA_S_AVX2_6x8 ss + +%macro FILTER_VER_CHROMA_S_AVX2_6x16 1 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_6x16, 4, 7, 9 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + +%ifdef PIC + lea r5, [pw_ChromaCoeffV] + add r5, r4 +%else + lea r5, [pw_ChromaCoeffV + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,sp + mova m8, [pd_526336] +%else + add r3d, r3d +%endif + lea r6, [r3 * 3] + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] +%ifidn %1,sp + paddd m0, m8 + paddd m1, m8 + psrad m0, 12 + psrad m1, 12 +%else + psrad m0, 6 + psrad m1, 6 +%endif + packssdw m0, m1 + + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm1, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m1 +%ifidn %1,sp + paddd m2, m8 + paddd m3, m8 + psrad m2, 12 + psrad m3, 12 +%else + psrad m2, 6 + psrad m3, 6 +%endif + packssdw m2, m3 +%ifidn %1,sp + packuswb m0, m2 + vextracti128 xm2, m0, 1 + movd [r2], xm0 + pextrw [r2 + 4], xm2, 0 + pextrd [r2 + r3], xm0, 1 + pextrw [r2 + r3 + 4], xm2, 2 + pextrd [r2 + r3 * 2], xm0, 2 + pextrw [r2 + r3 * 2 + 4], xm2, 4 + pextrd [r2 + r6], xm0, 3 + pextrw [r2 + r6 + 4], xm2, 6 +%else + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r6], xm2 + vextracti128 xm0, m0, 1 + vextracti128 xm3, m2, 1 + movd [r2 + 8], xm0 + pextrd [r2 + r3 + 8], xm0, 2 + movd [r2 + r3 * 2 + 8], xm3 + pextrd [r2 + r6 + 8], xm3, 2 +%endif + lea r2, [r2 + r3 * 4] + movu xm1, [r0 + r4] ; m1 = row 7 + punpckhwd xm0, xm6, xm1 + punpcklwd xm6, xm1 + vinserti128 m6, m6, xm0, 1 + pmaddwd m0, m6, [r5 + 1 * mmsize] + pmaddwd m6, [r5] + paddd m4, m0 + lea r0, [r0 + r1 * 4] + movu xm0, [r0] ; m0 = row 8 + punpckhwd xm2, xm1, xm0 + punpcklwd xm1, xm0 + vinserti128 m1, m1, xm2, 1 + pmaddwd m2, m1, [r5 + 1 * mmsize] + pmaddwd m1, [r5] + paddd m5, m2 +%ifidn %1,sp + paddd m4, m8 + paddd m5, m8 + psrad m4, 12 + psrad m5, 12 +%else + psrad m4, 6 + psrad m5, 6 +%endif + packssdw m4, m5 + + movu xm2, [r0 + r1] ; m2 = row 9 + punpckhwd xm5, xm0, xm2 + punpcklwd xm0, xm2 + vinserti128 m0, m0, xm5, 1 + pmaddwd m5, m0, [r5 + 1 * mmsize] + paddd m6, m5 + pmaddwd m0, [r5] + movu xm5, [r0 + r1 * 2] ; m5 = row 10 + punpckhwd xm7, xm2, xm5 + punpcklwd xm2, xm5 + vinserti128 m2, m2, xm7, 1 + pmaddwd m7, m2, [r5 + 1 * mmsize] + paddd m1, m7 + pmaddwd m2, [r5] + +%ifidn %1,sp + paddd m6, m8 + paddd m1, m8 + psrad m6, 12 + psrad m1, 12 +%else + psrad m6, 6 + psrad m1, 6 +%endif + packssdw m6, m1 +%ifidn %1,sp + packuswb m4, m6 + vextracti128 xm6, m4, 1 + movd [r2], xm4 + pextrw [r2 + 4], xm6, 0 + pextrd [r2 + r3], xm4, 1 + pextrw [r2 + r3 + 4], xm6, 2 + pextrd [r2 + r3 * 2], xm4, 2 + pextrw [r2 + r3 * 2 + 4], xm6, 4 + pextrd [r2 + r6], xm4, 3 + pextrw [r2 + r6 + 4], xm6, 6 +%else + movq [r2], xm4 + movhps [r2 + r3], xm4 + movq [r2 + r3 * 2], xm6 + movhps [r2 + r6], xm6 + vextracti128 xm4, m4, 1 + vextracti128 xm1, m6, 1 + movd [r2 + 8], xm4 + pextrd [r2 + r3 + 8], xm4, 2 + movd [r2 + r3 * 2 + 8], xm1 + pextrd [r2 + r6 + 8], xm1, 2 +%endif + lea r2, [r2 + r3 * 4] + movu xm7, [r0 + r4] ; m7 = row 11 + punpckhwd xm1, xm5, xm7 + punpcklwd xm5, xm7 + vinserti128 m5, m5, xm1, 1 + pmaddwd m1, m5, [r5 + 1 * mmsize] + paddd m0, m1 + pmaddwd m5, [r5] + lea r0, [r0 + r1 * 4] + movu xm1, [r0] ; m1 = row 12 + punpckhwd xm4, xm7, xm1 + punpcklwd xm7, xm1 + vinserti128 m7, m7, xm4, 1 + pmaddwd m4, m7, [r5 + 1 * mmsize] + paddd m2, m4 + pmaddwd m7, [r5] +%ifidn %1,sp + paddd m0, m8 + paddd m2, m8 + psrad m0, 12 + psrad m2, 12 +%else + psrad m0, 6 + psrad m2, 6 +%endif + packssdw m0, m2 + + movu xm4, [r0 + r1] ; m4 = row 13 + punpckhwd xm2, xm1, xm4 + punpcklwd xm1, xm4 + vinserti128 m1, m1, xm2, 1 + pmaddwd m2, m1, [r5 + 1 * mmsize] + paddd m5, m2 + pmaddwd m1, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 14 + punpckhwd xm6, xm4, xm2 + punpcklwd xm4, xm2 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m7, m6 + pmaddwd m4, [r5] +%ifidn %1,sp + paddd m5, m8 + paddd m7, m8 + psrad m5, 12 + psrad m7, 12 +%else + psrad m5, 6 + psrad m7, 6 +%endif + packssdw m5, m7 +%ifidn %1,sp + packuswb m0, m5 + vextracti128 xm5, m0, 1 + movd [r2], xm0 + pextrw [r2 + 4], xm5, 0 + pextrd [r2 + r3], xm0, 1 + pextrw [r2 + r3 + 4], xm5, 2 + pextrd [r2 + r3 * 2], xm0, 2 + pextrw [r2 + r3 * 2 + 4], xm5, 4 + pextrd [r2 + r6], xm0, 3 + pextrw [r2 + r6 + 4], xm5, 6 +%else + movq [r2], xm0 + movhps [r2 + r3], xm0 + movq [r2 + r3 * 2], xm5 + movhps [r2 + r6], xm5 + vextracti128 xm0, m0, 1 + vextracti128 xm7, m5, 1 + movd [r2 + 8], xm0 + pextrd [r2 + r3 + 8], xm0, 2 + movd [r2 + r3 * 2 + 8], xm7 + pextrd [r2 + r6 + 8], xm7, 2 +%endif + lea r2, [r2 + r3 * 4] + + movu xm6, [r0 + r4] ; m6 = row 15 + punpckhwd xm5, xm2, xm6 + punpcklwd xm2, xm6 + vinserti128 m2, m2, xm5, 1 + pmaddwd m5, m2, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm0, [r0] ; m0 = row 16 + punpckhwd xm5, xm6, xm0 + punpcklwd xm6, xm0 + vinserti128 m6, m6, xm5, 1 + pmaddwd m5, m6, [r5 + 1 * mmsize] + paddd m4, m5 + pmaddwd m6, [r5] +%ifidn %1,sp + paddd m1, m8 + paddd m4, m8 + psrad m1, 12 + psrad m4, 12 +%else + psrad m1, 6 + psrad m4, 6 +%endif + packssdw m1, m4 + + movu xm5, [r0 + r1] ; m5 = row 17 + punpckhwd xm4, xm0, xm5 + punpcklwd xm0, xm5 + vinserti128 m0, m0, xm4, 1 + pmaddwd m0, [r5 + 1 * mmsize] + paddd m2, m0 + movu xm4, [r0 + r1 * 2] ; m4 = row 18 + punpckhwd xm0, xm5, xm4 + punpcklwd xm5, xm4 + vinserti128 m5, m5, xm0, 1 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m6, m5 +%ifidn %1,sp + paddd m2, m8 + paddd m6, m8 + psrad m2, 12 + psrad m6, 12 +%else + psrad m2, 6 + psrad m6, 6 +%endif + packssdw m2, m6 +%ifidn %1,sp + packuswb m1, m2 + vextracti128 xm2, m1, 1 + movd [r2], xm1 + pextrw [r2 + 4], xm2, 0 + pextrd [r2 + r3], xm1, 1 + pextrw [r2 + r3 + 4], xm2, 2 + pextrd [r2 + r3 * 2], xm1, 2 + pextrw [r2 + r3 * 2 + 4], xm2, 4 + pextrd [r2 + r6], xm1, 3 + pextrw [r2 + r6 + 4], xm2, 6 +%else + movq [r2], xm1 + movhps [r2 + r3], xm1 + movq [r2 + r3 * 2], xm2 + movhps [r2 + r6], xm2 + vextracti128 xm4, m1, 1 + vextracti128 xm6, m2, 1 + movd [r2 + 8], xm4 + pextrd [r2 + r3 + 8], xm4, 2 + movd [r2 + r3 * 2 + 8], xm6 + pextrd [r2 + r6 + 8], xm6, 2 +%endif + RET +%endif +%endmacro + + FILTER_VER_CHROMA_S_AVX2_6x16 sp + FILTER_VER_CHROMA_S_AVX2_6x16 ss ;--------------------------------------------------------------------------------------------------------------------- ; void interp_4tap_vertical_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) @@ -16031,10 +20048,10 @@ RET %endmacro -FILTER_VER_CHROMA_SS_W2_4R 2, 4 -FILTER_VER_CHROMA_SS_W2_4R 2, 8 + FILTER_VER_CHROMA_SS_W2_4R 2, 4 + FILTER_VER_CHROMA_SS_W2_4R 2, 8 -FILTER_VER_CHROMA_SS_W2_4R 2, 16 + FILTER_VER_CHROMA_SS_W2_4R 2, 16 ;--------------------------------------------------------------------------------------------------------------- ; void interp_4tap_vert_ss_4x2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) @@ -16147,9 +20164,9 @@ RET %endmacro -FILTER_VER_CHROMA_SS_W6_H4 6, 8 + FILTER_VER_CHROMA_SS_W6_H4 6, 8 -FILTER_VER_CHROMA_SS_W6_H4 6, 16 + FILTER_VER_CHROMA_SS_W6_H4 6, 16 ;---------------------------------------------------------------------------------------------------------------- @@ -16194,15 +20211,15 @@ RET %endmacro -FILTER_VER_CHROMA_SS_W8_H2 8, 2 -FILTER_VER_CHROMA_SS_W8_H2 8, 4 -FILTER_VER_CHROMA_SS_W8_H2 8, 6 -FILTER_VER_CHROMA_SS_W8_H2 8, 8 -FILTER_VER_CHROMA_SS_W8_H2 8, 16 -FILTER_VER_CHROMA_SS_W8_H2 8, 32 + FILTER_VER_CHROMA_SS_W8_H2 8, 2 + FILTER_VER_CHROMA_SS_W8_H2 8, 4 + FILTER_VER_CHROMA_SS_W8_H2 8, 6 + FILTER_VER_CHROMA_SS_W8_H2 8, 8 + FILTER_VER_CHROMA_SS_W8_H2 8, 16 + FILTER_VER_CHROMA_SS_W8_H2 8, 32 -FILTER_VER_CHROMA_SS_W8_H2 8, 12 -FILTER_VER_CHROMA_SS_W8_H2 8, 64 + FILTER_VER_CHROMA_SS_W8_H2 8, 12 + FILTER_VER_CHROMA_SS_W8_H2 8, 64 ;----------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) @@ -16442,8 +20459,8 @@ RET %endmacro -FILTER_VER_LUMA_AVX2_4x4 sp -FILTER_VER_LUMA_AVX2_4x4 ss + FILTER_VER_LUMA_AVX2_4x4 sp + FILTER_VER_LUMA_AVX2_4x4 ss %macro FILTER_VER_LUMA_AVX2_4x8 1 INIT_YMM avx2 @@ -16588,8 +20605,8 @@ RET %endmacro -FILTER_VER_LUMA_AVX2_4x8 sp -FILTER_VER_LUMA_AVX2_4x8 ss + FILTER_VER_LUMA_AVX2_4x8 sp + FILTER_VER_LUMA_AVX2_4x8 ss %macro PROCESS_LUMA_AVX2_W4_16R 1 movq xm0, [r0] @@ -16833,8 +20850,8 @@ RET %endmacro -FILTER_VER_LUMA_AVX2_4x16 sp -FILTER_VER_LUMA_AVX2_4x16 ss + FILTER_VER_LUMA_AVX2_4x16 sp + FILTER_VER_LUMA_AVX2_4x16 ss %macro FILTER_VER_LUMA_S_AVX2_8x8 1 INIT_YMM avx2 @@ -17056,8 +21073,8 @@ %endif %endmacro -FILTER_VER_LUMA_S_AVX2_8x8 sp -FILTER_VER_LUMA_S_AVX2_8x8 ss + FILTER_VER_LUMA_S_AVX2_8x8 sp + FILTER_VER_LUMA_S_AVX2_8x8 ss %macro FILTER_VER_LUMA_S_AVX2_8xN 2 INIT_YMM avx2 @@ -17446,10 +21463,10 @@ %endif %endmacro -FILTER_VER_LUMA_S_AVX2_8xN sp, 16 -FILTER_VER_LUMA_S_AVX2_8xN sp, 32 -FILTER_VER_LUMA_S_AVX2_8xN ss, 16 -FILTER_VER_LUMA_S_AVX2_8xN ss, 32 + FILTER_VER_LUMA_S_AVX2_8xN sp, 16 + FILTER_VER_LUMA_S_AVX2_8xN sp, 32 + FILTER_VER_LUMA_S_AVX2_8xN ss, 16 + FILTER_VER_LUMA_S_AVX2_8xN ss, 32 %macro PROCESS_LUMA_S_AVX2_W8_4R 1 movu xm0, [r0] ; m0 = row 0 @@ -17592,8 +21609,8 @@ RET %endmacro -FILTER_VER_LUMA_S_AVX2_8x4 sp -FILTER_VER_LUMA_S_AVX2_8x4 ss + FILTER_VER_LUMA_S_AVX2_8x4 sp + FILTER_VER_LUMA_S_AVX2_8x4 ss %macro PROCESS_LUMA_AVX2_W8_16R 1 movu xm0, [r0] ; m0 = row 0 @@ -17988,12 +22005,12 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_Nx16 sp, 16 -FILTER_VER_LUMA_AVX2_Nx16 sp, 32 -FILTER_VER_LUMA_AVX2_Nx16 sp, 64 -FILTER_VER_LUMA_AVX2_Nx16 ss, 16 -FILTER_VER_LUMA_AVX2_Nx16 ss, 32 -FILTER_VER_LUMA_AVX2_Nx16 ss, 64 + FILTER_VER_LUMA_AVX2_Nx16 sp, 16 + FILTER_VER_LUMA_AVX2_Nx16 sp, 32 + FILTER_VER_LUMA_AVX2_Nx16 sp, 64 + FILTER_VER_LUMA_AVX2_Nx16 ss, 16 + FILTER_VER_LUMA_AVX2_Nx16 ss, 32 + FILTER_VER_LUMA_AVX2_Nx16 ss, 64 %macro FILTER_VER_LUMA_AVX2_NxN 3 INIT_YMM avx2 @@ -18047,24 +22064,24 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_NxN 16, 32, sp -FILTER_VER_LUMA_AVX2_NxN 16, 64, sp -FILTER_VER_LUMA_AVX2_NxN 24, 32, sp -FILTER_VER_LUMA_AVX2_NxN 32, 32, sp -FILTER_VER_LUMA_AVX2_NxN 32, 64, sp -FILTER_VER_LUMA_AVX2_NxN 48, 64, sp -FILTER_VER_LUMA_AVX2_NxN 64, 32, sp -FILTER_VER_LUMA_AVX2_NxN 64, 48, sp -FILTER_VER_LUMA_AVX2_NxN 64, 64, sp -FILTER_VER_LUMA_AVX2_NxN 16, 32, ss -FILTER_VER_LUMA_AVX2_NxN 16, 64, ss -FILTER_VER_LUMA_AVX2_NxN 24, 32, ss -FILTER_VER_LUMA_AVX2_NxN 32, 32, ss -FILTER_VER_LUMA_AVX2_NxN 32, 64, ss -FILTER_VER_LUMA_AVX2_NxN 48, 64, ss -FILTER_VER_LUMA_AVX2_NxN 64, 32, ss -FILTER_VER_LUMA_AVX2_NxN 64, 48, ss -FILTER_VER_LUMA_AVX2_NxN 64, 64, ss + FILTER_VER_LUMA_AVX2_NxN 16, 32, sp + FILTER_VER_LUMA_AVX2_NxN 16, 64, sp + FILTER_VER_LUMA_AVX2_NxN 24, 32, sp + FILTER_VER_LUMA_AVX2_NxN 32, 32, sp + FILTER_VER_LUMA_AVX2_NxN 32, 64, sp + FILTER_VER_LUMA_AVX2_NxN 48, 64, sp + FILTER_VER_LUMA_AVX2_NxN 64, 32, sp + FILTER_VER_LUMA_AVX2_NxN 64, 48, sp + FILTER_VER_LUMA_AVX2_NxN 64, 64, sp + FILTER_VER_LUMA_AVX2_NxN 16, 32, ss + FILTER_VER_LUMA_AVX2_NxN 16, 64, ss + FILTER_VER_LUMA_AVX2_NxN 24, 32, ss + FILTER_VER_LUMA_AVX2_NxN 32, 32, ss + FILTER_VER_LUMA_AVX2_NxN 32, 64, ss + FILTER_VER_LUMA_AVX2_NxN 48, 64, ss + FILTER_VER_LUMA_AVX2_NxN 64, 32, ss + FILTER_VER_LUMA_AVX2_NxN 64, 48, ss + FILTER_VER_LUMA_AVX2_NxN 64, 64, ss %macro FILTER_VER_LUMA_S_AVX2_12x16 1 INIT_YMM avx2 @@ -18102,8 +22119,8 @@ %endif %endmacro -FILTER_VER_LUMA_S_AVX2_12x16 sp -FILTER_VER_LUMA_S_AVX2_12x16 ss + FILTER_VER_LUMA_S_AVX2_12x16 sp + FILTER_VER_LUMA_S_AVX2_12x16 ss %macro FILTER_VER_LUMA_S_AVX2_16x12 1 INIT_YMM avx2 @@ -18416,8 +22433,8 @@ %endif %endmacro -FILTER_VER_LUMA_S_AVX2_16x12 sp -FILTER_VER_LUMA_S_AVX2_16x12 ss + FILTER_VER_LUMA_S_AVX2_16x12 sp + FILTER_VER_LUMA_S_AVX2_16x12 ss %macro FILTER_VER_LUMA_S_AVX2_16x4 1 INIT_YMM avx2 @@ -18464,8 +22481,8 @@ RET %endmacro -FILTER_VER_LUMA_S_AVX2_16x4 sp -FILTER_VER_LUMA_S_AVX2_16x4 ss + FILTER_VER_LUMA_S_AVX2_16x4 sp + FILTER_VER_LUMA_S_AVX2_16x4 ss %macro PROCESS_LUMA_S_AVX2_W8_8R 1 movu xm0, [r0] ; m0 = row 0 @@ -18701,10 +22718,10 @@ %endif %endmacro -FILTER_VER_LUMA_AVX2_Nx8 sp, 32 -FILTER_VER_LUMA_AVX2_Nx8 sp, 16 -FILTER_VER_LUMA_AVX2_Nx8 ss, 32 -FILTER_VER_LUMA_AVX2_Nx8 ss, 16 + FILTER_VER_LUMA_AVX2_Nx8 sp, 32 + FILTER_VER_LUMA_AVX2_Nx8 sp, 16 + FILTER_VER_LUMA_AVX2_Nx8 ss, 32 + FILTER_VER_LUMA_AVX2_Nx8 ss, 16 %macro FILTER_VER_LUMA_S_AVX2_32x24 1 INIT_YMM avx2 @@ -18764,13 +22781,13 @@ %endif %endmacro -FILTER_VER_LUMA_S_AVX2_32x24 sp -FILTER_VER_LUMA_S_AVX2_32x24 ss + FILTER_VER_LUMA_S_AVX2_32x24 sp + FILTER_VER_LUMA_S_AVX2_32x24 ss ;----------------------------------------------------------------------------------------------------------------------------- ; void interp_4tap_horiz_ps_32x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) ;-----------------------------------------------------------------------------------------------------------------------------; -INIT_YMM avx2 +INIT_YMM avx2 cglobal interp_4tap_horiz_ps_32x32, 4,7,6 mov r4d, r4m mov r5d, r5m @@ -18832,12 +22849,12 @@ add r0, r1 dec r6d jnz .loop - RET + RET ;----------------------------------------------------------------------------------------------------------------------------- ; void interp_4tap_horiz_ps_16x16(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) ;-----------------------------------------------------------------------------------------------------------------------------; -INIT_YMM avx2 +INIT_YMM avx2 cglobal interp_4tap_horiz_ps_16x16, 4,7,6 mov r4d, r4m mov r5d, r5m @@ -18885,13 +22902,13 @@ add r0, r1 dec r6d jnz .loop - RET + RET ;----------------------------------------------------------------------------------------------------------------------------- ; void interp_4tap_horiz_ps_16xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) ;----------------------------------------------------------------------------------------------------------------------------- %macro IPFILTER_CHROMA_PS_16xN_AVX2 2 -INIT_YMM avx2 +INIT_YMM avx2 cglobal interp_4tap_horiz_ps_%1x%2, 4,7,6 mov r4d, r4m mov r5d, r5m @@ -18947,12 +22964,14 @@ IPFILTER_CHROMA_PS_16xN_AVX2 16 , 12 IPFILTER_CHROMA_PS_16xN_AVX2 16 , 8 IPFILTER_CHROMA_PS_16xN_AVX2 16 , 4 + IPFILTER_CHROMA_PS_16xN_AVX2 16 , 24 + IPFILTER_CHROMA_PS_16xN_AVX2 16 , 64 ;----------------------------------------------------------------------------------------------------------------------------- ; void interp_4tap_horiz_ps_32xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) ;----------------------------------------------------------------------------------------------------------------------------- %macro IPFILTER_CHROMA_PS_32xN_AVX2 2 -INIT_YMM avx2 +INIT_YMM avx2 cglobal interp_4tap_horiz_ps_%1x%2, 4,7,6 mov r4d, r4m mov r5d, r5m @@ -19019,13 +23038,15 @@ RET %endmacro -IPFILTER_CHROMA_PS_32xN_AVX2 32 , 16 -IPFILTER_CHROMA_PS_32xN_AVX2 32 , 24 -IPFILTER_CHROMA_PS_32xN_AVX2 32 , 8 + IPFILTER_CHROMA_PS_32xN_AVX2 32 , 16 + IPFILTER_CHROMA_PS_32xN_AVX2 32 , 24 + IPFILTER_CHROMA_PS_32xN_AVX2 32 , 8 + IPFILTER_CHROMA_PS_32xN_AVX2 32 , 64 + IPFILTER_CHROMA_PS_32xN_AVX2 32 , 48 ;----------------------------------------------------------------------------------------------------------------------------- ; void interp_4tap_horiz_ps_4x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) ;----------------------------------------------------------------------------------------------------------------------------- -INIT_YMM avx2 +INIT_YMM avx2 cglobal interp_4tap_horiz_ps_4x4, 4,7,5 mov r4d, r4m mov r5d, r5m @@ -19104,7 +23125,7 @@ lea r2, [r2 + r3 * 2] movhps [r2], xm3 .end - RET + RET cglobal interp_4tap_horiz_ps_4x2, 4,7,5 mov r4d, r4m @@ -19173,13 +23194,13 @@ lea r2, [r2 + r3 * 2] movhps [r2], xm3 .end - RET + RET ;----------------------------------------------------------------------------------------------------------------------------- ; void interp_4tap_horiz_ps_4xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) ;-----------------------------------------------------------------------------------------------------------------------------; %macro IPFILTER_CHROMA_PS_4xN_AVX2 2 -INIT_YMM avx2 +INIT_YMM avx2 cglobal interp_4tap_horiz_ps_%1x%2, 4,7,5 mov r4d, r4m mov r5d, r5m @@ -19264,7 +23285,7 @@ lea r2, [r2 + r3 * 2] movhps [r2], xm3 .end -RET + RET %endmacro IPFILTER_CHROMA_PS_4xN_AVX2 4 , 8 @@ -19272,7 +23293,7 @@ ;----------------------------------------------------------------------------------------------------------------------------- ; void interp_4tap_horiz_ps_8x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) ;-----------------------------------------------------------------------------------------------------------------------------; -INIT_YMM avx2 +INIT_YMM avx2 cglobal interp_4tap_horiz_ps_8x8, 4,7,6 mov r4d, r4m mov r5d, r5m @@ -19341,9 +23362,9 @@ vpermq m3, m3, 11011000b movu [r2], xm3 .end - RET + RET -INIT_YMM avx2 +INIT_YMM avx2 cglobal interp_4tap_horiz_pp_4x2, 4,6,4 mov r4d, r4m %ifdef PIC @@ -19436,9 +23457,11 @@ RET %endmacro -IPFILTER_CHROMA_PP_32xN_AVX2 32, 16 -IPFILTER_CHROMA_PP_32xN_AVX2 32, 24 -IPFILTER_CHROMA_PP_32xN_AVX2 32, 8 + IPFILTER_CHROMA_PP_32xN_AVX2 32, 16 + IPFILTER_CHROMA_PP_32xN_AVX2 32, 24 + IPFILTER_CHROMA_PP_32xN_AVX2 32, 8 + IPFILTER_CHROMA_PP_32xN_AVX2 32, 64 + IPFILTER_CHROMA_PP_32xN_AVX2 32, 48 ;------------------------------------------------------------------------------------------------------------- ; void interp_4tap_horiz_pp_8xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx @@ -19512,15 +23535,17 @@ RET %endmacro -IPFILTER_CHROMA_PP_8xN_AVX2 8 , 16 -IPFILTER_CHROMA_PP_8xN_AVX2 8 , 32 -IPFILTER_CHROMA_PP_8xN_AVX2 8 , 4 + IPFILTER_CHROMA_PP_8xN_AVX2 8 , 16 + IPFILTER_CHROMA_PP_8xN_AVX2 8 , 32 + IPFILTER_CHROMA_PP_8xN_AVX2 8 , 4 + IPFILTER_CHROMA_PP_8xN_AVX2 8 , 64 + IPFILTER_CHROMA_PP_8xN_AVX2 8 , 12 ;------------------------------------------------------------------------------------------------------------- ; void interp_4tap_horiz_pp_4xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx ;------------------------------------------------------------------------------------------------------------- %macro IPFILTER_CHROMA_PP_4xN_AVX2 2 -INIT_YMM avx2 +INIT_YMM avx2 cglobal interp_4tap_horiz_pp_%1x%2, 4,6,6 mov r4d, r4m @@ -19576,8 +23601,8 @@ RET %endmacro -IPFILTER_CHROMA_PP_4xN_AVX2 4 , 8 -IPFILTER_CHROMA_PP_4xN_AVX2 4 , 16 + IPFILTER_CHROMA_PP_4xN_AVX2 4 , 8 + IPFILTER_CHROMA_PP_4xN_AVX2 4 , 16 %macro IPFILTER_LUMA_PS_32xN_AVX2 2 INIT_YMM avx2 @@ -19674,11 +23699,11 @@ RET %endmacro -IPFILTER_LUMA_PS_32xN_AVX2 32 , 32 -IPFILTER_LUMA_PS_32xN_AVX2 32 , 16 -IPFILTER_LUMA_PS_32xN_AVX2 32 , 24 -IPFILTER_LUMA_PS_32xN_AVX2 32 , 8 -IPFILTER_LUMA_PS_32xN_AVX2 32 , 64 + IPFILTER_LUMA_PS_32xN_AVX2 32 , 32 + IPFILTER_LUMA_PS_32xN_AVX2 32 , 16 + IPFILTER_LUMA_PS_32xN_AVX2 32 , 24 + IPFILTER_LUMA_PS_32xN_AVX2 32 , 8 + IPFILTER_LUMA_PS_32xN_AVX2 32 , 64 INIT_YMM avx2 cglobal interp_8tap_horiz_ps_48x64, 4, 7, 8 @@ -20003,10 +24028,12 @@ RET %endmacro -IPFILTER_CHROMA_PP_16xN_AVX2 16 , 8 -IPFILTER_CHROMA_PP_16xN_AVX2 16 , 32 -IPFILTER_CHROMA_PP_16xN_AVX2 16 , 12 -IPFILTER_CHROMA_PP_16xN_AVX2 16 , 4 + IPFILTER_CHROMA_PP_16xN_AVX2 16 , 8 + IPFILTER_CHROMA_PP_16xN_AVX2 16 , 32 + IPFILTER_CHROMA_PP_16xN_AVX2 16 , 12 + IPFILTER_CHROMA_PP_16xN_AVX2 16 , 4 + IPFILTER_CHROMA_PP_16xN_AVX2 16 , 64 + IPFILTER_CHROMA_PP_16xN_AVX2 16 , 24 %macro IPFILTER_LUMA_PS_64xN_AVX2 1 INIT_YMM avx2 @@ -20144,16 +24171,16 @@ RET %endmacro -IPFILTER_LUMA_PS_64xN_AVX2 64 -IPFILTER_LUMA_PS_64xN_AVX2 48 -IPFILTER_LUMA_PS_64xN_AVX2 32 -IPFILTER_LUMA_PS_64xN_AVX2 16 + IPFILTER_LUMA_PS_64xN_AVX2 64 + IPFILTER_LUMA_PS_64xN_AVX2 48 + IPFILTER_LUMA_PS_64xN_AVX2 32 + IPFILTER_LUMA_PS_64xN_AVX2 16 ;----------------------------------------------------------------------------------------------------------------------------- ; void interp_4tap_horiz_ps_8xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) ;----------------------------------------------------------------------------------------------------------------------------- %macro IPFILTER_CHROMA_PS_8xN_AVX2 1 -INIT_YMM avx2 +INIT_YMM avx2 cglobal interp_4tap_horiz_ps_8x%1, 4,7,6 mov r4d, r4m mov r5d, r5m @@ -20218,7 +24245,7 @@ vpermq m3, m3, 11011000b movu [r2], xm3 .end - RET + RET %endmacro IPFILTER_CHROMA_PS_8xN_AVX2 2 @@ -20226,6 +24253,8 @@ IPFILTER_CHROMA_PS_8xN_AVX2 16 IPFILTER_CHROMA_PS_8xN_AVX2 6 IPFILTER_CHROMA_PS_8xN_AVX2 4 + IPFILTER_CHROMA_PS_8xN_AVX2 12 + IPFILTER_CHROMA_PS_8xN_AVX2 64 INIT_YMM avx2 cglobal interp_4tap_horiz_ps_2x4, 4, 7, 3 @@ -20253,7 +24282,7 @@ movhps xm2, [r0 + r6] vinserti128 m1, m1, xm2, 1 - pshufb m1, [interp4_hps_shuf] + pshufb m1, [interp4_hpp_shuf] pmaddubsw m1, m0 pmaddwd m1, [pw_1] vextracti128 xm2, m1, 1 @@ -20275,7 +24304,7 @@ movhps xm1, [r0 + r1] movq xm2, [r0 + r1 * 2] vinserti128 m1, m1, xm2, 1 - pshufb m1, [interp4_hps_shuf] + pshufb m1, [interp4_hpp_shuf] pmaddubsw m1, m0 pmaddwd m1, [pw_1] vextracti128 xm2, m1, 1 @@ -20306,7 +24335,7 @@ sub r0, r1 .label - mova m4, [interp4_hps_shuf] + mova m4, [interp4_hpp_shuf] mova m5, [pw_1] dec r0 lea r4, [r1 * 3] @@ -20488,7 +24517,7 @@ ;----------------------------------------------------------------------------------------------------------------------------- ; void interp_4tap_horiz_ps_6x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) ;-----------------------------------------------------------------------------------------------------------------------------; -INIT_YMM avx2 +INIT_YMM avx2 cglobal interp_4tap_horiz_ps_6x8, 4,7,6 mov r4d, r4m mov r5d, r5m @@ -20556,3 +24585,1024 @@ movd [r2+8], xm4 .end RET + +INIT_YMM avx2 +cglobal interp_8tap_horiz_ps_12x16, 6, 7, 8 + mov r5d, r5m + mov r4d, r4m +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + mova m6, [tab_Lm + 32] + mova m1, [tab_Lm] + add r3d, r3d + vbroadcasti128 m2, [pw_2000] + mov r4d, 16 + vbroadcasti128 m7, [pw_1] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - pw_2000 + + mova m5, [interp8_hps_shuf] + sub r0, 3 + test r5d, r5d + jz .loop + lea r6, [r1 * 3] ; r6 = (N / 2 - 1) * srcStride + sub r0, r6 ; r0(src)-r6 + add r4d, 7 +.loop + + ; Row 0 + + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m3, m6 + pshufb m3, m1 ; shuffled based on the col order tab_Lm + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m7 + packssdw m4, m4 + + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + + vpermd m3, m5, m3 + psubw m3, m2 + + vextracti128 xm4, m3, 1 + movu [r2], xm3 ;row 0 + movq [r2 + 16], xm4 ;row 1 + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET + +INIT_YMM avx2 +cglobal interp_8tap_horiz_ps_24x32, 4, 7, 8 + mov r5d, r5m + mov r4d, r4m +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + mova m6, [tab_Lm + 32] + mova m1, [tab_Lm] + mov r4d, 32 ;height + add r3d, r3d + vbroadcasti128 m2, [pw_2000] + vbroadcasti128 m7, [pw_1] + + ; register map + ; m0 - interpolate coeff + ; m1 , m6 - shuffle order table + ; m2 - pw_2000 + + sub r0, 3 + test r5d, r5d + jz .label + lea r6, [r1 * 3] ; r6 = (N / 2 - 1) * srcStride + sub r0, r6 ; r0(src)-r6 + add r4d, 7 ; blkheight += N - 1 (7 - 1 = 6 ; since the last one row not in loop) + +.label + lea r6, [interp8_hps_shuf] +.loop + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m3, m6 ; row 0 (col 4 to 7) + pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3) + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m4, m6 ;row 1 (col 4 to 7) + pshufb m4, m1 ;row 1 (col 0 to 3) + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmaddwd m4, m7 + pmaddwd m5, m7 + packssdw m4, m5 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + mova m5, [r6] + vpermd m3, m5, m3 + psubw m3, m2 + movu [r2], m3 ;row 0 + + vbroadcasti128 m3, [r0 + 16] + pshufb m4, m3, m6 + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + pmaddwd m3, m7 + pmaddwd m4, m7 + packssdw m3, m4 + mova m4, [r6] + vpermd m3, m4, m3 + psubw m3, m2 + movu [r2 + 32], xm3 ;row 0 + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_24x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_24x32, 4,7,6 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m5, [pw_2000] + mova m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + mov r6d, 32 + dec r0 + test r5d, r5d + je .loop + sub r0 , r1 + add r6d , 3 + +.loop + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2], m3 + + vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + packssdw m3, m3 + psubw m3, m5 + vpermq m3, m3, 11011000b + movu [r2 + 32], xm3 + + add r2, r3 + add r0, r1 + dec r6d + jnz .loop + RET + +;----------------------------------------------------------------------------------------------------------------------- +;macro FILTER_H8_W8_16N_AVX2 +;----------------------------------------------------------------------------------------------------------------------- +%macro FILTER_H8_W8_16N_AVX2 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m3, m6 ; row 0 (col 4 to 7) + pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3) + pmaddubsw m3, m0 + pmaddubsw m4, m0 + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] + + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m4, m6 ;row 1 (col 4 to 7) + pshufb m4, m1 ;row 1 (col 0 to 3) + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmaddwd m4, m2 + pmaddwd m5, m2 + packssdw m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A] + + pmaddwd m3, m2 + pmaddwd m4, m2 + packssdw m3, m4 ; all rows and col completed. + + mova m5, [interp8_hps_shuf] + vpermd m3, m5, m3 + psubw m3, m8 + + vextracti128 xm4, m3, 1 + mova [r4], xm3 + mova [r4 + 16], xm4 + %endmacro + +;----------------------------------------------------------------------------- +; void interp_8tap_hv_pp_16x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY) +;----------------------------------------------------------------------------- +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_hv_pp_16x16, 4, 10, 15, 0-31*32 +%define stk_buf1 rsp + mov r4d, r4m + mov r5d, r5m +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + + xor r6, r6 + mov r4, rsp + mova m6, [tab_Lm + 32] + mova m1, [tab_Lm] + mov r8, 16 ;height + vbroadcasti128 m8, [pw_2000] + vbroadcasti128 m2, [pw_1] + sub r0, 3 + lea r7, [r1 * 3] ; r7 = (N / 2 - 1) * srcStride + sub r0, r7 ; r0(src)-r7 + add r8, 7 + +.loopH: + FILTER_H8_W8_16N_AVX2 + add r0, r1 + add r4, 32 + inc r6 + cmp r6, 16+7 + jnz .loopH + +; vertical phase + xor r6, r6 + xor r1, r1 +.loopV: + +;load necessary variables + mov r4d, r5d ;coeff here for vertical is r5m + shl r4d, 7 + mov r1d, 16 + add r1d, r1d + + ; load intermedia buffer + mov r0, stk_buf1 + + ; register mapping + ; r0 - src + ; r5 - coeff + ; r6 - loop_i + +; load coeff table +%ifdef PIC + lea r5, [pw_LumaCoeffVer] + add r5, r4 +%else + lea r5, [pw_LumaCoeffVer + r4] +%endif + + lea r4, [r1*3] + mova m14, [pd_526336] + lea r6, [r3 * 3] + mov r9d, 16 / 8 + +.loopW: + PROCESS_LUMA_AVX2_W8_16R sp + add r2, 8 + add r0, 16 + dec r9d + jnz .loopW + RET +%endif + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_12x32, 4, 6, 7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m6, [pw_512] + mova m1, [interp4_horiz_shuf1] + vpbroadcastd m2, [pw_1] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + mov r4d, 16 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + r1 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vpermq m3, m3, 11011000b + + vextracti128 xm4, m3, 1 + movq [r2], xm3 + pextrd [r2+8], xm3, 2 + movq [r2 + r3], xm4 + pextrd [r2 + r3 + 8],xm4, 2 + lea r2, [r2 + r3 * 2] + lea r0, [r0 + r1 * 2] + dec r4d + jnz .loop + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_24x64, 4,6,7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m1, [interp4_horiz_shuf1] + vpbroadcastd m2, [pw_1] + mova m6, [pw_512] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + mov r4d, 64 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 4] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + vbroadcasti128 m4, [r0 + 16] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + 20] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vpermq m3, m3, 11011000b + + vextracti128 xm4, m3, 1 + movu [r2], xm3 + movq [r2 + 16], xm4 + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET + + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_2x16, 4, 6, 6 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m4, [interp4_hpp_shuf] + mova m5, [pw_1] + dec r0 + lea r4, [r1 * 3] + movq xm1, [r0] + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m1, m1, xm2, 1 + lea r0, [r0 + r1 * 4] + movq xm3, [r0] + movhps xm3, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m3, m3, xm2, 1 + + pshufb m1, m4 + pshufb m3, m4 + pmaddubsw m1, m0 + pmaddubsw m3, m0 + pmaddwd m1, m5 + pmaddwd m3, m5 + packssdw m1, m3 + pmulhrsw m1, [pw_512] + vextracti128 xm2, m1, 1 + packuswb xm1, xm2 + + lea r4, [r3 * 3] + pextrw [r2], xm1, 0 + pextrw [r2 + r3], xm1, 1 + pextrw [r2 + r3 * 2], xm1, 4 + pextrw [r2 + r4], xm1, 5 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm1, 2 + pextrw [r2 + r3], xm1, 3 + pextrw [r2 + r3 * 2], xm1, 6 + pextrw [r2 + r4], xm1, 7 + lea r2, [r2 + r3 * 4] + lea r0, [r0 + r1 * 4] + + lea r4, [r1 * 3] + movq xm1, [r0] + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m1, m1, xm2, 1 + lea r0, [r0 + r1 * 4] + movq xm3, [r0] + movhps xm3, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m3, m3, xm2, 1 + + pshufb m1, m4 + pshufb m3, m4 + pmaddubsw m1, m0 + pmaddubsw m3, m0 + pmaddwd m1, m5 + pmaddwd m3, m5 + packssdw m1, m3 + pmulhrsw m1, [pw_512] + vextracti128 xm2, m1, 1 + packuswb xm1, xm2 + + lea r4, [r3 * 3] + pextrw [r2], xm1, 0 + pextrw [r2 + r3], xm1, 1 + pextrw [r2 + r3 * 2], xm1, 4 + pextrw [r2 + r4], xm1, 5 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm1, 2 + pextrw [r2 + r3], xm1, 3 + pextrw [r2 + r3 * 2], xm1, 6 + pextrw [r2 + r4], xm1, 7 + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_64xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_PP_64xN_AVX2 1 +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_64x%1, 4,6,7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m1, [interp4_horiz_shuf1] + vpbroadcastd m2, [pw_1] + mova m6, [pw_512] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + mov r4d, %1 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 4] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + vbroadcasti128 m4, [r0 + 16] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + 20] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + packuswb m3, m4 + vpermq m3, m3, 11011000b + movu [r2], m3 + + vbroadcasti128 m3, [r0 + 32] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 36] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + vbroadcasti128 m4, [r0 + 48] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + 52] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + packuswb m3, m4 + vpermq m3, m3, 11011000b + movu [r2 + 32], m3 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET +%endmacro + + IPFILTER_CHROMA_PP_64xN_AVX2 64 + IPFILTER_CHROMA_PP_64xN_AVX2 32 + IPFILTER_CHROMA_PP_64xN_AVX2 48 + IPFILTER_CHROMA_PP_64xN_AVX2 16 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_48x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_48x64, 4,6,7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m1, [interp4_horiz_shuf1] + vpbroadcastd m2, [pw_1] + mova m6, [pw_512] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 + mov r4d, 64 + +.loop: + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 4] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + vbroadcasti128 m4, [r0 + 16] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + 20] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vpermq m3, m3, q3120 + + movu [r2], m3 + + vbroadcasti128 m3, [r0 + mmsize] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + mmsize + 4] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + vbroadcasti128 m4, [r0 + mmsize + 16] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + vbroadcasti128 m5, [r0 + mmsize + 20] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vpermq m3, m3, q3120 + movu [r2 + mmsize], xm3 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_48x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;-----------------------------------------------------------------------------------------------------------------------------; + +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_48x64, 4,7,6 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m5, [pw_2000] + mova m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + mov r6d, 64 + dec r0 + test r5d, r5d + je .loop + sub r0 , r1 + add r6d , 3 + +.loop + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, q3120 + movu [r2], m3 + + vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 24] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, q3120 + movu [r2 + 32], m3 + + vbroadcasti128 m3, [r0 + 32] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 40] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, q3120 + movu [r2 + 64], m3 + + add r2, r3 + add r0, r1 + dec r6d + jnz .loop + RET + +;----------------------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_24x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;----------------------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_24x64, 4,7,6 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + vbroadcasti128 m2, [pw_1] + vbroadcasti128 m5, [pw_2000] + mova m1, [tab_Tm] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + mov r6d, 64 + dec r0 + test r5d, r5d + je .loop + sub r0 , r1 + add r6d , 3 + +.loop + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + psubw m3, m5 + vpermq m3, m3, q3120 + movu [r2], m3 + + vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + packssdw m3, m3 + psubw m3, m5 + vpermq m3, m3, q3120 + movu [r2 + 32], xm3 + + add r2, r3 + add r0, r1 + dec r6d + jnz .loop + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_ps_2x16, 4, 7, 7 + mov r4d, r4m + mov r5d, r5m + add r3d, r3d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + vbroadcasti128 m6, [pw_2000] + test r5d, r5d + jz .label + sub r0, r1 + +.label + mova m4, [interp4_hps_shuf] + mova m5, [pw_1] + dec r0 + lea r4, [r1 * 3] + movq xm1, [r0] ;row 0 + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m1, m1, xm2, 1 + lea r0, [r0 + r1 * 4] + movq xm3, [r0] + movhps xm3, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m3, m3, xm2, 1 + + pshufb m1, m4 + pshufb m3, m4 + pmaddubsw m1, m0 + pmaddubsw m3, m0 + pmaddwd m1, m5 + pmaddwd m3, m5 + packssdw m1, m3 + psubw m1, m6 + + lea r4, [r3 * 3] + vextracti128 xm2, m1, 1 + + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + movd [r2 + r3 * 2], xm2 + pextrd [r2 + r4], xm2, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm1, 2 + pextrd [r2 + r3], xm1, 3 + pextrd [r2 + r3 * 2], xm2, 2 + pextrd [r2 + r4], xm2, 3 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + lea r4, [r1 * 3] + movq xm1, [r0] + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m1, m1, xm2, 1 + lea r0, [r0 + r1 * 4] + movq xm3, [r0] + movhps xm3, [r0 + r1] + movq xm2, [r0 + r1 * 2] + movhps xm2, [r0 + r4] + vinserti128 m3, m3, xm2, 1 + + pshufb m1, m4 + pshufb m3, m4 + pmaddubsw m1, m0 + pmaddubsw m3, m0 + pmaddwd m1, m5 + pmaddwd m3, m5 + packssdw m1, m3 + psubw m1, m6 + + lea r4, [r3 * 3] + vextracti128 xm2, m1, 1 + + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + movd [r2 + r3 * 2], xm2 + pextrd [r2 + r4], xm2, 1 + lea r2, [r2 + r3 * 4] + pextrd [r2], xm1, 2 + pextrd [r2 + r3], xm1, 3 + pextrd [r2 + r3 * 2], xm2, 2 + pextrd [r2 + r4], xm2, 3 + + test r5d, r5d + jz .end + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + movq xm1, [r0] + movhps xm1, [r0 + r1] + movq xm2, [r0 + r1 * 2] + vinserti128 m1, m1, xm2, 1 + pshufb m1, m4 + pmaddubsw m1, m0 + pmaddwd m1, m5 + packssdw m1, m1 + psubw m1, m6 + vextracti128 xm2, m1, 1 + + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + movd [r2 + r3 * 2], xm2 +.end + RET + +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_6x16, 4, 6, 7 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + mova m1, [tab_Tm] + mova m2, [pw_1] + mova m6, [pw_512] + lea r4, [r1 * 3] + lea r5, [r3 * 3] + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + + dec r0 +%rep 4 + ; Row 0 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + + ; Row 1 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + packssdw m3, m4 + pmulhrsw m3, m6 + + ; Row 2 + vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + + ; Row 3 + vbroadcasti128 m5, [r0 + r4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m5, m1 + pmaddubsw m5, m0 + pmaddwd m5, m2 + packssdw m4, m5 + pmulhrsw m4, m6 + + packuswb m3, m4 + vextracti128 xm4, m3, 1 + movd [r2], xm3 + pextrw [r2 + 4], xm4, 0 + pextrd [r2 + r3], xm3, 1 + pextrw [r2 + r3 + 4], xm4, 2 + pextrd [r2 + r3 * 2], xm3, 2 + pextrw [r2 + r3 * 2 + 4], xm4, 4 + pextrd [r2 + r5], xm3, 3 + pextrw [r2 + r5 + 4], xm4, 6 + lea r2, [r2 + r3 * 4] + lea r0, [r0 + r1 * 4] +%endrep + RET
View file
x265_1.6.tar.gz/source/common/x86/ipfilter8.h -> x265_1.7.tar.gz/source/common/x86/ipfilter8.h
Changed
@@ -289,16 +289,114 @@ SETUP_CHROMA_420_HORIZ_FUNC_DEF(64, 16, cpu); \ SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 64, cpu) -void x265_chroma_p2s_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height); -void x265_luma_p2s_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height); +void x265_filterPixelToShort_4x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_4x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_4x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_24x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_12x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_48x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x4_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x12_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x24_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x48_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_24x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_48x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); + +#define SETUP_CHROMA_P2S_FUNC_DEF(W, H, cpu) \ + void x265_filterPixelToShort_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); + +#define CHROMA_420_P2S_FILTERS_SSSE3(cpu) \ + SETUP_CHROMA_P2S_FUNC_DEF(4, 2, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(8, 2, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(8, 6, cpu); + +#define CHROMA_420_P2S_FILTERS_SSE4(cpu) \ + SETUP_CHROMA_P2S_FUNC_DEF(2, 4, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(6, 8, cpu); + +#define CHROMA_422_P2S_FILTERS_SSSE3(cpu) \ + SETUP_CHROMA_P2S_FUNC_DEF(4, 32, cpu) \ + SETUP_CHROMA_P2S_FUNC_DEF(8, 12, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(8, 64, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(12, 32, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); + +#define CHROMA_422_P2S_FILTERS_SSE4(cpu) \ + SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(2, 16, cpu) \ + SETUP_CHROMA_P2S_FUNC_DEF(6, 16, cpu); + +#define CHROMA_420_P2S_FILTERS_AVX2(cpu) \ + SETUP_CHROMA_P2S_FUNC_DEF(16, 4, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(16, 8, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(16, 12, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(16, 16, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(16, 32, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(24, 32, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 8, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 24, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); + +#define CHROMA_422_P2S_FILTERS_AVX2(cpu) \ + SETUP_CHROMA_P2S_FUNC_DEF(16, 8, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(16, 16, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(16, 32, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 64, cpu); CHROMA_420_VERT_FILTERS(_sse2); CHROMA_420_HORIZ_FILTERS(_sse4); CHROMA_420_VERT_FILTERS_SSE4(_sse4); +CHROMA_420_P2S_FILTERS_SSSE3(_ssse3); +CHROMA_420_P2S_FILTERS_SSE4(_sse4); +CHROMA_420_P2S_FILTERS_AVX2(_avx2); CHROMA_422_VERT_FILTERS(_sse2); CHROMA_422_HORIZ_FILTERS(_sse4); CHROMA_422_VERT_FILTERS_SSE4(_sse4); +CHROMA_422_P2S_FILTERS_SSE4(_sse4); +CHROMA_422_P2S_FILTERS_SSSE3(_ssse3); +CHROMA_422_P2S_FILTERS_AVX2(_avx2); CHROMA_444_VERT_FILTERS(_sse2); CHROMA_444_HORIZ_FILTERS(_sse4); @@ -572,6 +670,48 @@ SETUP_CHROMA_SS_FUNC_DEF(64, 16, cpu); \ SETUP_CHROMA_SS_FUNC_DEF(16, 64, cpu); +#define SETUP_CHROMA_P2S_FUNC_DEF(W, H, cpu) \ + void x265_filterPixelToShort_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); + +#define CHROMA_420_P2S_FILTERS_SSE4(cpu) \ + SETUP_CHROMA_P2S_FUNC_DEF(2, 4, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(4, 2, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(6, 8, cpu); + +#define CHROMA_420_P2S_FILTERS_SSSE3(cpu) \ + SETUP_CHROMA_P2S_FUNC_DEF(8, 2, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(8, 6, cpu); + +#define CHROMA_422_P2S_FILTERS_SSE4(cpu) \ + SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(2, 16, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(6, 16, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(4, 32, cpu); + +#define CHROMA_422_P2S_FILTERS_SSSE3(cpu) \ + SETUP_CHROMA_P2S_FUNC_DEF(8, 12, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(8, 64, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(12, 32, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); + +#define CHROMA_420_P2S_FILTERS_AVX2(cpu) \ + SETUP_CHROMA_P2S_FUNC_DEF(24, 32, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 8, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 24, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); + +#define CHROMA_422_P2S_FILTERS_AVX2(cpu) \ + SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); \ + SETUP_CHROMA_P2S_FUNC_DEF(32, 64, cpu); + CHROMA_420_FILTERS(_sse4); CHROMA_420_FILTERS(_avx2); CHROMA_420_SP_FILTERS(_sse2); @@ -582,19 +722,32 @@ CHROMA_420_SS_FILTERS_SSE4(_sse4); CHROMA_420_SS_FILTERS(_avx2); CHROMA_420_SS_FILTERS_SSE4(_avx2); +CHROMA_420_P2S_FILTERS_SSE4(_sse4); +CHROMA_420_P2S_FILTERS_SSSE3(_ssse3); +CHROMA_420_P2S_FILTERS_AVX2(_avx2); CHROMA_422_FILTERS(_sse4); CHROMA_422_FILTERS(_avx2); CHROMA_422_SP_FILTERS(_sse2); +CHROMA_422_SP_FILTERS(_avx2); CHROMA_422_SP_FILTERS_SSE4(_sse4); +CHROMA_422_SP_FILTERS_SSE4(_avx2); CHROMA_422_SS_FILTERS(_sse2); +CHROMA_422_SS_FILTERS(_avx2); CHROMA_422_SS_FILTERS_SSE4(_sse4); +CHROMA_422_SS_FILTERS_SSE4(_avx2); +CHROMA_422_P2S_FILTERS_SSE4(_sse4); +CHROMA_422_P2S_FILTERS_SSSE3(_ssse3); +CHROMA_422_P2S_FILTERS_AVX2(_avx2); +void x265_interp_4tap_vert_ss_2x4_avx2(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_2x4_avx2(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); CHROMA_444_FILTERS(_sse4); CHROMA_444_SP_FILTERS(_sse4); CHROMA_444_SS_FILTERS(_sse2); - -void x265_chroma_p2s_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height); +CHROMA_444_FILTERS(_avx2); +CHROMA_444_SP_FILTERS(_avx2); +CHROMA_444_SS_FILTERS(_avx2); #undef SETUP_CHROMA_FUNC_DEF #undef SETUP_CHROMA_SP_FUNC_DEF @@ -623,29 +776,155 @@ LUMA_FILTERS(_avx2); LUMA_SP_FILTERS(_avx2); LUMA_SS_FILTERS(_avx2); -void x265_interp_8tap_hv_pp_8x8_sse4(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); -void x265_pixelToShort_4x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_4x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_4x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); -void x265_pixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_interp_8tap_hv_pp_8x8_ssse3(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); +void x265_interp_8tap_hv_pp_16x16_avx2(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); +void x265_filterPixelToShort_4x4_sse4(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_4x8_sse4(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_4x16_sse4(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_12x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_24x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_48x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x24_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x48_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_48x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_24x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_interp_4tap_horiz_pp_2x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_2x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_2x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_4x2_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_4x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_4x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_4x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_4x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_6x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_6x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x2_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x6_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x12_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_12x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_12x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_16x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_16x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_16x12_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_16x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_16x24_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_16x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_16x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_24x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_24x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_32x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_32x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_32x24_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_32x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_32x48_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_32x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_48x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_64x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_64x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_64x48_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_64x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_4x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_4x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_4x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_8x4_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_8x8_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_8x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_8x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_12x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_16x4_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_16x8_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_16x12_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_16x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_16x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_16x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_24x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_32x8_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_32x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_32x24_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_32x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_32x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_48x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_64x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_64x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_64x48_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_pp_64x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_horiz_ps_4x4_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_4x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_4x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_8x4_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_8x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_8x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_8x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_12x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_16x4_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_16x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_16x12_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_16x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_16x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_16x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_24x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_32x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_32x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_32x24_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_32x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_32x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_48x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_64x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_64x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_64x48_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_horiz_ps_64x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_8tap_hv_pp_8x8_sse3(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); +void x265_interp_4tap_vert_pp_2x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_2x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_2x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_4x2_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_4x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_4x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_4x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_4x32_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +#ifdef X86_64 +void x265_interp_4tap_vert_pp_6x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_6x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x2_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x6_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x12_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x32_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x64_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); +#endif #undef LUMA_FILTERS #undef LUMA_SP_FILTERS #undef LUMA_SS_FILTERS
View file
x265_1.6.tar.gz/source/common/x86/loopfilter.asm -> x265_1.7.tar.gz/source/common/x86/loopfilter.asm
Changed
@@ -28,31 +28,39 @@ %include "x86inc.asm" SECTION_RODATA 32 -pb_31: times 16 db 31 -pb_15: times 16 db 15 +pb_31: times 32 db 31 +pb_15: times 32 db 15 +pb_movemask_32: times 32 db 0x00 + times 32 db 0xFF SECTION .text cextern pb_1 cextern pb_128 cextern pb_2 cextern pw_2 +cextern pb_movemask ;============================================================================================================ -; void saoCuOrgE0(pixel * rec, int8_t * offsetEo, int lcuWidth, int8_t signLeft) +; void saoCuOrgE0(pixel * rec, int8_t * offsetEo, int lcuWidth, int8_t* signLeft, intptr_t stride) ;============================================================================================================ INIT_XMM sse4 -cglobal saoCuOrgE0, 4, 4, 8, rec, offsetEo, lcuWidth, signLeft +cglobal saoCuOrgE0, 5, 5, 8, rec, offsetEo, lcuWidth, signLeft, stride - neg r3 ; r3 = -signLeft - movzx r3d, r3b - movd m0, r3d - mova m4, [pb_128] ; m4 = [80] - pxor m5, m5 ; m5 = 0 - movu m6, [r1] ; m6 = offsetEo + mov r4d, r4m + mova m4, [pb_128] ; m4 = [80] + pxor m5, m5 ; m5 = 0 + movu m6, [r1] ; m6 = offsetEo + + movzx r1d, byte [r3] + inc r3 + neg r1b + movd m0, r1d + lea r1, [r0 + r4] + mov r4d, r2d .loop: - movu m7, [r0] ; m1 = rec[x] + movu m7, [r0] ; m7 = rec[x] movu m2, [r0 + 1] ; m2 = rec[x+1] pxor m1, m7, m4 @@ -69,7 +77,7 @@ pxor m0, m0 palignr m0, m2, 15 paddb m2, m3 - paddb m2, [pb_2] ; m1 = uiEdgeType + paddb m2, [pb_2] ; m2 = uiEdgeType pshufb m3, m6, m2 pmovzxbw m2, m7 ; rec punpckhbw m7, m5 @@ -84,6 +92,97 @@ add r0q, 16 sub r2d, 16 jnz .loop + + movzx r3d, byte [r3] + neg r3b + movd m0, r3d +.loopH: + movu m7, [r1] ; m7 = rec[x] + movu m2, [r1 + 1] ; m2 = rec[x+1] + + pxor m1, m7, m4 + pxor m3, m2, m4 + pcmpgtb m2, m1, m3 + pcmpgtb m3, m1 + pand m2, [pb_1] + por m2, m3 + + pslldq m3, m2, 1 + por m3, m0 + + psignb m3, m4 ; m3 = signLeft + pxor m0, m0 + palignr m0, m2, 15 + paddb m2, m3 + paddb m2, [pb_2] ; m2 = uiEdgeType + pshufb m3, m6, m2 + pmovzxbw m2, m7 ; rec + punpckhbw m7, m5 + pmovsxbw m1, m3 ; offsetEo + punpckhbw m3, m3 + psraw m3, 8 + paddw m2, m1 + paddw m7, m3 + packuswb m2, m7 + movu [r1], m2 + + add r1q, 16 + sub r4d, 16 + jnz .loopH + RET + +INIT_YMM avx2 +cglobal saoCuOrgE0, 5, 5, 7, rec, offsetEo, lcuWidth, signLeft, stride + + mov r4d, r4m + vbroadcasti128 m4, [pb_128] ; m4 = [80] + vbroadcasti128 m6, [r1] ; m6 = offsetEo + movzx r1d, byte [r3] + neg r1b + movd xm0, r1d + movzx r1d, byte [r3 + 1] + neg r1b + movd xm1, r1d + vinserti128 m0, m0, xm1, 1 + +.loop: + movu xm5, [r0] ; xm5 = rec[x] + movu xm2, [r0 + 1] ; xm2 = rec[x + 1] + vinserti128 m5, m5, [r0 + r4], 1 + vinserti128 m2, m2, [r0 + r4 + 1], 1 + + pxor m1, m5, m4 + pxor m3, m2, m4 + pcmpgtb m2, m1, m3 + pcmpgtb m3, m1 + pand m2, [pb_1] + por m2, m3 + + pslldq m3, m2, 1 + por m3, m0 + + psignb m3, m4 ; m3 = signLeft + pxor m0, m0 + palignr m0, m2, 15 + paddb m2, m3 + paddb m2, [pb_2] ; m2 = uiEdgeType + pshufb m3, m6, m2 + pmovzxbw m2, xm5 ; rec + vextracti128 xm5, m5, 1 + pmovzxbw m5, xm5 + pmovsxbw m1, xm3 ; offsetEo + vextracti128 xm3, m3, 1 + pmovsxbw m3, xm3 + paddw m2, m1 + paddw m5, m3 + packuswb m2, m5 + vpermq m2, m2, 11011000b + movu [r0], xm2 + vextracti128 [r0 + r4], m2, 1 + + add r0q, 16 + sub r2d, 16 + jnz .loop RET ;================================================================================================== @@ -94,117 +193,382 @@ mov r3d, r3m mov r4d, r4m pxor m0, m0 ; m0 = 0 - movu m6, [pb_2] ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] + mova m6, [pb_2] ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] mova m7, [pb_128] shr r4d, 4 - .loop - movu m1, [r0] ; m1 = pRec[x] - movu m2, [r0 + r3] ; m2 = pRec[x + iStride] - - pxor m3, m1, m7 - pxor m4, m2, m7 - pcmpgtb m2, m3, m4 - pcmpgtb m4, m3 - pand m2, [pb_1] - por m2, m4 - - movu m3, [r1] ; m3 = m_iUpBuff1 - - paddb m3, m2 - paddb m3, m6 - - movu m4, [r2] ; m4 = m_iOffsetEo - pshufb m5, m4, m3 - - psubb m3, m0, m2 - movu [r1], m3 - - pmovzxbw m2, m1 - punpckhbw m1, m0 - pmovsxbw m3, m5 - punpckhbw m5, m5 - psraw m5, 8 - - paddw m2, m3 - paddw m1, m5 - packuswb m2, m1 - movu [r0], m2 - - add r0, 16 - add r1, 16 - dec r4d - jnz .loop +.loop + movu m1, [r0] ; m1 = pRec[x] + movu m2, [r0 + r3] ; m2 = pRec[x + iStride] + + pxor m3, m1, m7 + pxor m4, m2, m7 + pcmpgtb m2, m3, m4 + pcmpgtb m4, m3 + pand m2, [pb_1] + por m2, m4 + + movu m3, [r1] ; m3 = m_iUpBuff1 + + paddb m3, m2 + paddb m3, m6 + + movu m4, [r2] ; m4 = m_iOffsetEo + pshufb m5, m4, m3 + + psubb m3, m0, m2 + movu [r1], m3 + + pmovzxbw m2, m1 + punpckhbw m1, m0 + pmovsxbw m3, m5 + punpckhbw m5, m5 + psraw m5, 8 + + paddw m2, m3 + paddw m1, m5 + packuswb m2, m1 + movu [r0], m2 + + add r0, 16 + add r1, 16 + dec r4d + jnz .loop + RET + +INIT_YMM avx2 +cglobal saoCuOrgE1, 3, 5, 8, pRec, m_iUpBuff1, m_iOffsetEo, iStride, iLcuWidth + mov r3d, r3m + mov r4d, r4m + movu xm0, [r2] ; xm0 = m_iOffsetEo + mova xm6, [pb_2] ; xm6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] + mova xm7, [pb_128] + shr r4d, 4 +.loop + movu xm1, [r0] ; xm1 = pRec[x] + movu xm2, [r0 + r3] ; xm2 = pRec[x + iStride] + + pxor xm3, xm1, xm7 + pxor xm4, xm2, xm7 + pcmpgtb xm2, xm3, xm4 + pcmpgtb xm4, xm3 + pand xm2, [pb_1] + por xm2, xm4 + + movu xm3, [r1] ; xm3 = m_iUpBuff1 + + paddb xm3, xm2 + paddb xm3, xm6 + + pshufb xm5, xm0, xm3 + pxor xm4, xm4 + psubb xm3, xm4, xm2 + movu [r1], xm3 + + pmovzxbw m2, xm1 + pmovsxbw m3, xm5 + + paddw m2, m3 + vextracti128 xm3, m2, 1 + packuswb xm2, xm3 + movu [r0], xm2 + + add r0, 16 + add r1, 16 + dec r4d + jnz .loop + RET + +;======================================================================================================== +; void saoCuOrgE1_2Rows(pixel *pRec, int8_t *m_iUpBuff1, int8_t *m_iOffsetEo, Int iStride, Int iLcuWidth) +;======================================================================================================== +INIT_XMM sse4 +cglobal saoCuOrgE1_2Rows, 3, 5, 8, pRec, m_iUpBuff1, m_iOffsetEo, iStride, iLcuWidth + mov r3d, r3m + mov r4d, r4m + pxor m0, m0 ; m0 = 0 + mova m7, [pb_128] + shr r4d, 4 +.loop + movu m1, [r0] ; m1 = pRec[x] + movu m2, [r0 + r3] ; m2 = pRec[x + iStride] + + pxor m3, m1, m7 + pxor m4, m2, m7 + pcmpgtb m6, m3, m4 + pcmpgtb m5, m4, m3 + pand m6, [pb_1] + por m6, m5 + + movu m5, [r0 + r3 * 2] + pxor m3, m5, m7 + pcmpgtb m5, m4, m3 + pcmpgtb m3, m4 + pand m5, [pb_1] + por m5, m3 + + movu m3, [r1] ; m3 = m_iUpBuff1 + paddb m3, m6 + paddb m3, [pb_2] + + movu m4, [r2] ; m4 = m_iOffsetEo + pshufb m4, m3 + + psubb m3, m0, m6 + movu [r1], m3 + + pmovzxbw m6, m1 + punpckhbw m1, m0 + pmovsxbw m3, m4 + punpckhbw m4, m4 + psraw m4, 8 + + paddw m6, m3 + paddw m1, m4 + packuswb m6, m1 + movu [r0], m6 + + movu m3, [r1] ; m3 = m_iUpBuff1 + paddb m3, m5 + paddb m3, [pb_2] + + movu m4, [r2] ; m4 = m_iOffsetEo + pshufb m4, m3 + psubb m3, m0, m5 + movu [r1], m3 + + pmovzxbw m5, m2 + punpckhbw m2, m0 + pmovsxbw m3, m4 + punpckhbw m4, m4 + psraw m4, 8 + + paddw m5, m3 + paddw m2, m4 + packuswb m5, m2 + movu [r0 + r3], m5 + + add r0, 16 + add r1, 16 + dec r4d + jnz .loop + RET + +INIT_YMM avx2 +cglobal saoCuOrgE1_2Rows, 3, 5, 7, pRec, m_iUpBuff1, m_iOffsetEo, iStride, iLcuWidth + mov r3d, r3m + mov r4d, r4m + pxor m0, m0 ; m0 = 0 + vbroadcasti128 m5, [pb_128] + vbroadcasti128 m6, [r2] ; m6 = m_iOffsetEo + shr r4d, 4 +.loop + movu xm1, [r0] ; m1 = pRec[x] + movu xm2, [r0 + r3] ; m2 = pRec[x + iStride] + vinserti128 m1, m1, xm2, 1 + vinserti128 m2, m2, [r0 + r3 * 2], 1 + + pxor m3, m1, m5 + pxor m4, m2, m5 + pcmpgtb m2, m3, m4 + pcmpgtb m4, m3 + pand m2, [pb_1] + por m2, m4 + + movu xm3, [r1] ; xm3 = m_iUpBuff + psubb m4, m0, m2 + vinserti128 m3, m3, xm4, 1 + paddb m3, m2 + paddb m3, [pb_2] + pshufb m2, m6, m3 + vextracti128 [r1], m4, 1 + + pmovzxbw m4, xm1 + vextracti128 xm3, m1, 1 + pmovzxbw m3, xm3 + pmovsxbw m1, xm2 + vextracti128 xm2, m2, 1 + pmovsxbw m2, xm2 + + paddw m4, m1 + paddw m3, m2 + packuswb m4, m3 + vpermq m4, m4, 11011000b + movu [r0], xm4 + vextracti128 [r0 + r3], m4, 1 + + add r0, 16 + add r1, 16 + dec r4d + jnz .loop RET ;====================================================================================================================================================== ; void saoCuOrgE2(pixel * rec, int8_t * bufft, int8_t * buff1, int8_t * offsetEo, int lcuWidth, intptr_t stride) ;====================================================================================================================================================== INIT_XMM sse4 -cglobal saoCuOrgE2, 5, 7, 8, rec, bufft, buff1, offsetEo, lcuWidth - - mov r6, 16 +cglobal saoCuOrgE2, 5, 6, 8, rec, bufft, buff1, offsetEo, lcuWidth + mov r4d, r4m mov r5d, r5m pxor m0, m0 ; m0 = 0 mova m6, [pb_2] ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] mova m7, [pb_128] - shr r4d, 4 - inc r1q - - .loop - movu m1, [r0] ; m1 = rec[x] - movu m2, [r0 + r5 + 1] ; m2 = rec[x + stride + 1] - pxor m3, m1, m7 - pxor m4, m2, m7 - pcmpgtb m2, m3, m4 - pcmpgtb m4, m3 - pand m2, [pb_1] - por m2, m4 - movu m3, [r2] ; m3 = buff1 - - paddb m3, m2 - paddb m3, m6 ; m3 = edgeType - - movu m4, [r3] ; m4 = offsetEo - pshufb m4, m3 - - psubb m3, m0, m2 - movu [r1], m3 - - pmovzxbw m2, m1 - punpckhbw m1, m0 - pmovsxbw m3, m4 - punpckhbw m4, m4 - psraw m4, 8 - - paddw m2, m3 - paddw m1, m4 - packuswb m2, m1 - movu [r0], m2 - - add r0, r6 - add r1, r6 - add r2, r6 - dec r4d - jnz .loop + inc r1 + movh m5, [r0 + r4] + movhps m5, [r1 + r4] + +.loop + movu m1, [r0] ; m1 = rec[x] + movu m2, [r0 + r5 + 1] ; m2 = rec[x + stride + 1] + pxor m3, m1, m7 + pxor m4, m2, m7 + pcmpgtb m2, m3, m4 + pcmpgtb m4, m3 + pand m2, [pb_1] + por m2, m4 + movu m3, [r2] ; m3 = buff1 + + paddb m3, m2 + paddb m3, m6 ; m3 = edgeType + + movu m4, [r3] ; m4 = offsetEo + pshufb m4, m3 + + psubb m3, m0, m2 + movu [r1], m3 + + pmovzxbw m2, m1 + punpckhbw m1, m0 + pmovsxbw m3, m4 + punpckhbw m4, m4 + psraw m4, 8 + + paddw m2, m3 + paddw m1, m4 + packuswb m2, m1 + movu [r0], m2 + + add r0, 16 + add r1, 16 + add r2, 16 + sub r4, 16 + jg .loop + + movh [r0 + r4], m5 + movhps [r1 + r4], m5 + RET + +INIT_YMM avx2 +cglobal saoCuOrgE2, 5, 6, 7, rec, bufft, buff1, offsetEo, lcuWidth + mov r4d, r4m + mov r5d, r5m + pxor xm0, xm0 ; xm0 = 0 + mova xm5, [pb_128] + inc r1 + movq xm6, [r0 + r4] + movhps xm6, [r1 + r4] + + movu xm1, [r0] ; xm1 = rec[x] + movu xm2, [r0 + r5 + 1] ; xm2 = rec[x + stride + 1] + pxor xm3, xm1, xm5 + pxor xm4, xm2, xm5 + pcmpgtb xm2, xm3, xm4 + pcmpgtb xm4, xm3 + pand xm2, [pb_1] + por xm2, xm4 + movu xm3, [r2] ; xm3 = buff1 + + paddb xm3, xm2 + paddb xm3, [pb_2] ; xm3 = edgeType + + movu xm4, [r3] ; xm4 = offsetEo + pshufb xm4, xm3 + + psubb xm3, xm0, xm2 + movu [r1], xm3 + + pmovzxbw m2, xm1 + pmovsxbw m3, xm4 + + paddw m2, m3 + vextracti128 xm3, m2, 1 + packuswb xm2, xm3 + movu [r0], xm2 + + movq [r0 + r4], xm6 + movhps [r1 + r4], xm6 + RET + +INIT_YMM avx2 +cglobal saoCuOrgE2_32, 5, 6, 8, rec, bufft, buff1, offsetEo, lcuWidth + mov r4d, r4m + mov r5d, r5m + pxor m0, m0 ; m0 = 0 + vbroadcasti128 m7, [pb_128] + vbroadcasti128 m5, [r3] ; m5 = offsetEo + inc r1 + movq xm6, [r0 + r4] + movhps xm6, [r1 + r4] + +.loop: + movu m1, [r0] ; m1 = rec[x] + movu m2, [r0 + r5 + 1] ; m2 = rec[x + stride + 1] + pxor m3, m1, m7 + pxor m4, m2, m7 + pcmpgtb m2, m3, m4 + pcmpgtb m4, m3 + pand m2, [pb_1] + por m2, m4 + movu m3, [r2] ; m3 = buff1 + + paddb m3, m2 + paddb m3, [pb_2] ; m3 = edgeType + + pshufb m4, m5, m3 + + psubb m3, m0, m2 + movu [r1], m3 + + pmovzxbw m2, xm1 + vextracti128 xm1, m1, 1 + pmovzxbw m1, xm1 + pmovsxbw m3, xm4 + vextracti128 xm4, m4, 1 + pmovsxbw m4, xm4 + + paddw m2, m3 + paddw m1, m4 + packuswb m2, m1 + vpermq m2, m2, 11011000b + movu [r0], m2 + + add r0, 32 + add r1, 32 + add r2, 32 + sub r4, 32 + jg .loop + + movq [r0 + r4], xm6 + movhps [r1 + r4], xm6 RET ;======================================================================================================= ;void saoCuOrgE3(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX) ;======================================================================================================= INIT_XMM sse4 -cglobal saoCuOrgE3, 3, 7, 8 +cglobal saoCuOrgE3, 3,6,8 mov r3d, r3m mov r4d, r4m mov r5d, r5m - mov r6d, r5d - sub r6d, r4d + ; save latest 2 pixels for case startX=1 or left_endX=15 + movh m7, [r0 + r5] + movhps m7, [r1 + r5 - 1] + ; move to startX+1 inc r4d add r0, r4 add r1, r4 - movh m7, [r0 + r6 - 1] - mov r6, [r1 + r6 - 2] + sub r5d, r4d pxor m0, m0 ; m0 = 0 movu m6, [pb_2] ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] @@ -244,30 +608,143 @@ packuswb m2, m1 movu [r0], m2 - sub r5d, 16 - jle .end + add r0, 16 + add r1, 16 - lea r0, [r0 + 16] - lea r1, [r1 + 16] + sub r5, 16 + jg .loop - jnz .loop + ; restore last pixels (up to 2) + movh [r0 + r5], m7 + movhps [r1 + r5 - 1], m7 + RET -.end: - js .skip - sub r0, r4 - sub r1, r4 - movh [r0 + 16], m7 - mov [r1 + 15], r6 - jmp .quit +INIT_YMM avx2 +cglobal saoCuOrgE3, 3, 6, 8 + mov r3d, r3m + mov r4d, r4m + mov r5d, r5m + + ; save latest 2 pixels for case startX=1 or left_endX=15 + movq xm7, [r0 + r5] + movhps xm7, [r1 + r5 - 1] + + ; move to startX+1 + inc r4d + add r0, r4 + add r1, r4 + sub r5d, r4d + pxor xm0, xm0 ; xm0 = 0 + mova xm6, [pb_2] ; xm6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] + movu xm5, [r2] ; xm5 = m_iOffsetEo + +.loop: + movu xm1, [r0] ; xm1 = pRec[x] + movu xm2, [r0 + r3] ; xm2 = pRec[x + iStride] + + psubusb xm3, xm2, xm1 + psubusb xm4, xm1, xm2 + pcmpeqb xm3, xm0 + pcmpeqb xm4, xm0 + pcmpeqb xm2, xm1 + + pabsb xm3, xm3 + por xm4, xm3 + pandn xm2, xm4 ; xm2 = iSignDown + + movu xm3, [r1] ; xm3 = m_iUpBuff1 + + paddb xm3, xm2 + paddb xm3, xm6 ; xm3 = uiEdgeType + + pshufb xm4, xm5, xm3 + + psubb xm3, xm0, xm2 + movu [r1 - 1], xm3 + + pmovzxbw m2, xm1 + pmovsxbw m3, xm4 + + paddw m2, m3 + vextracti128 xm3, m2, 1 + packuswb xm2, xm3 + movu [r0], xm2 + + add r0, 16 + add r1, 16 + + sub r5, 16 + jg .loop + + ; restore last pixels (up to 2) + movq [r0 + r5], xm7 + movhps [r1 + r5 - 1], xm7 + RET + +INIT_YMM avx2 +cglobal saoCuOrgE3_32, 3, 6, 8 + mov r3d, r3m + mov r4d, r4m + mov r5d, r5m + + ; save latest 2 pixels for case startX=1 or left_endX=15 + movq xm7, [r0 + r5] + movhps xm7, [r1 + r5 - 1] + + ; move to startX+1 + inc r4d + add r0, r4 + add r1, r4 + sub r5d, r4d + pxor m0, m0 ; m0 = 0 + mova m6, [pb_2] ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] + vbroadcasti128 m5, [r2] ; m5 = m_iOffsetEo + +.loop: + movu m1, [r0] ; m1 = pRec[x] + movu m2, [r0 + r3] ; m2 = pRec[x + iStride] + + psubusb m3, m2, m1 + psubusb m4, m1, m2 + pcmpeqb m3, m0 + pcmpeqb m4, m0 + pcmpeqb m2, m1 + + pabsb m3, m3 + por m4, m3 + pandn m2, m4 ; m2 = iSignDown + + movu m3, [r1] ; m3 = m_iUpBuff1 + + paddb m3, m2 + paddb m3, m6 ; m3 = uiEdgeType + + pshufb m4, m5, m3 + + psubb m3, m0, m2 + movu [r1 - 1], m3 + + pmovzxbw m2, xm1 + vextracti128 xm1, m1, 1 + pmovzxbw m1, xm1 + pmovsxbw m3, xm4 + vextracti128 xm4, m4, 1 + pmovsxbw m4, xm4 -.skip: - sub r0, r4 - sub r1, r4 - movh [r0 + 15], m7 - mov [r1 + 14], r6 + paddw m2, m3 + paddw m1, m4 + packuswb m2, m1 + vpermq m2, m2, 11011000b + movu [r0], m2 -.quit: + add r0, 32 + add r1, 32 + sub r5, 32 + jg .loop + ; restore last pixels (up to 2) + movq [r0 + r5], xm7 + movhps [r1 + r5 - 1], xm7 RET ;===================================================================================== @@ -320,32 +797,181 @@ jnz .loopH RET +INIT_YMM avx2 +cglobal saoCuOrgB0, 4, 7, 8 + + mov r3d, r3m + mov r4d, r4m + mova m7, [pb_31] + vbroadcasti128 m3, [r1 + 0] ; offset[0-15] + vbroadcasti128 m4, [r1 + 16] ; offset[16-31] + lea r6, [r4 * 2] + sub r6d, r2d + shr r2d, 4 + mov r1d, r3d + shr r3d, 1 +.loopH + mov r5d, r2d +.loopW + movu xm2, [r0] ; m2 = [rec] + vinserti128 m2, m2, [r0 + r4], 1 + psrlw m1, m2, 3 + pand m1, m7 ; m1 = [index] + pcmpgtb m0, m1, [pb_15] ; m0 = [mask] + + pshufb m6, m3, m1 + pshufb m5, m4, m1 + + pblendvb m6, m6, m5, m0 ; NOTE: don't use 3 parameters style, x264 macro have some bug! + + pmovzxbw m1, xm2 ; rec + vextracti128 xm2, m2, 1 + pmovzxbw m2, xm2 + pmovsxbw m0, xm6 ; offset + vextracti128 xm6, m6, 1 + pmovsxbw m6, xm6 + + paddw m1, m0 + paddw m2, m6 + packuswb m1, m2 + vpermq m1, m1, 11011000b + + movu [r0], xm1 + vextracti128 [r0 + r4], m1, 1 + add r0, 16 + dec r5d + jnz .loopW + + add r0, r6 + dec r3d + jnz .loopH + test r1b, 1 + jz .end + mov r5d, r2d +.loopW1 + movu xm2, [r0] ; m2 = [rec] + psrlw xm1, xm2, 3 + pand xm1, xm7 ; m1 = [index] + pcmpgtb xm0, xm1, [pb_15] ; m0 = [mask] + + pshufb xm6, xm3, xm1 + pshufb xm5, xm4, xm1 + + pblendvb xm6, xm6, xm5, xm0 ; NOTE: don't use 3 parameters style, x264 macro have some bug! + + pmovzxbw m1, xm2 ; rec + pmovsxbw m0, xm6 ; offset + + paddw m1, m0 + vextracti128 xm0, m1, 1 + packuswb xm1, xm0 + + movu [r0], xm1 + add r0, 16 + dec r5d + jnz .loopW1 +.end + RET + ;============================================================================================================ -; void calSign(int8_t *dst, const Pixel *src1, const Pixel *src2, const int endX) +; void calSign(int8_t *dst, const Pixel *src1, const Pixel *src2, const int width) ;============================================================================================================ INIT_XMM sse4 -cglobal calSign, 4, 5, 7 +cglobal calSign, 4,5,6 + mova m0, [pb_128] + mova m1, [pb_1] - mov r4, 16 - mova m1, [pb_128] - mova m0, [pb_1] - shr r3d, 4 -.loop - movu m2, [r1] ; m2 = pRec[x] - movu m3, [r2] ; m3 = pTmpU[x] + sub r1, r0 + sub r2, r0 - pxor m4, m2, m1 - pxor m5, m3, m1 - pcmpgtb m6, m4, m5 - pcmpgtb m5, m4 - pand m6, m0 - por m6, m5 + mov r4d, r3d + shr r3d, 4 + jz .next +.loop: + movu m2, [r0 + r1] ; m2 = pRec[x] + movu m3, [r0 + r2] ; m3 = pTmpU[x] + pxor m4, m2, m0 + pxor m3, m0 + pcmpgtb m5, m4, m3 + pcmpgtb m3, m4 + pand m5, m1 + por m5, m3 + movu [r0], m5 + + add r0, 16 + dec r3d + jnz .loop - movu [r0], m6 + ; process partial +.next: + and r4d, 15 + jz .end + + movu m2, [r0 + r1] ; m2 = pRec[x] + movu m3, [r0 + r2] ; m3 = pTmpU[x] + pxor m4, m2, m0 + pxor m3, m0 + pcmpgtb m5, m4, m3 + pcmpgtb m3, m4 + pand m5, m1 + por m5, m3 + + lea r3, [pb_movemask + 16] + sub r3, r4 + movu xmm0, [r3] + movu m3, [r0] + pblendvb m5, m5, m3, xmm0 + movu [r0], m5 - add r0, r4 - add r1, r4 - add r2, r4 - dec r3d - jnz .loop +.end: + RET + +INIT_YMM avx2 +cglobal calSign, 4, 5, 6 + vbroadcasti128 m0, [pb_128] + mova m1, [pb_1] + + sub r1, r0 + sub r2, r0 + + mov r4d, r3d + shr r3d, 5 + jz .next +.loop: + movu m2, [r0 + r1] ; m2 = pRec[x] + movu m3, [r0 + r2] ; m3 = pTmpU[x] + pxor m4, m2, m0 + pxor m3, m0 + pcmpgtb m5, m4, m3 + pcmpgtb m3, m4 + pand m5, m1 + por m5, m3 + movu [r0], m5 + + add r0, mmsize + dec r3d + jnz .loop + + ; process partial +.next: + and r4d, 31 + jz .end + + movu m2, [r0 + r1] ; m2 = pRec[x] + movu m3, [r0 + r2] ; m3 = pTmpU[x] + pxor m4, m2, m0 + pxor m3, m0 + pcmpgtb m5, m4, m3 + pcmpgtb m3, m4 + pand m5, m1 + por m5, m3 + + lea r3, [pb_movemask_32 + 32] + sub r3, r4 + movu m0, [r3] + movu m3, [r0] + pblendvb m5, m5, m3, m0 + movu [r0], m5 + +.end: RET
View file
x265_1.6.tar.gz/source/common/x86/loopfilter.h -> x265_1.7.tar.gz/source/common/x86/loopfilter.h
Changed
@@ -25,11 +25,21 @@ #ifndef X265_LOOPFILTER_H #define X265_LOOPFILTER_H -void x265_saoCuOrgE0_sse4(pixel * rec, int8_t * offsetEo, int endX, int8_t signLeft); +void x265_saoCuOrgE0_sse4(pixel * rec, int8_t * offsetEo, int endX, int8_t* signLeft, intptr_t stride); +void x265_saoCuOrgE0_avx2(pixel * rec, int8_t * offsetEo, int endX, int8_t* signLeft, intptr_t stride); void x265_saoCuOrgE1_sse4(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width); +void x265_saoCuOrgE1_avx2(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width); +void x265_saoCuOrgE1_2Rows_sse4(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width); +void x265_saoCuOrgE1_2Rows_avx2(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width); void x265_saoCuOrgE2_sse4(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride); +void x265_saoCuOrgE2_avx2(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride); +void x265_saoCuOrgE2_32_avx2(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride); void x265_saoCuOrgE3_sse4(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX); +void x265_saoCuOrgE3_avx2(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX); +void x265_saoCuOrgE3_32_avx2(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX); void x265_saoCuOrgB0_sse4(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride); +void x265_saoCuOrgB0_avx2(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride); void x265_calSign_sse4(int8_t *dst, const pixel *src1, const pixel *src2, const int endX); +void x265_calSign_avx2(int8_t *dst, const pixel *src1, const pixel *src2, const int endX); #endif // ifndef X265_LOOPFILTER_H
View file
x265_1.6.tar.gz/source/common/x86/mc-a.asm -> x265_1.7.tar.gz/source/common/x86/mc-a.asm
Changed
@@ -1895,8 +1895,10 @@ ADDAVG_W8_H4_AVX2 4 ADDAVG_W8_H4_AVX2 8 +ADDAVG_W8_H4_AVX2 12 ADDAVG_W8_H4_AVX2 16 ADDAVG_W8_H4_AVX2 32 +ADDAVG_W8_H4_AVX2 64 %macro ADDAVG_W12_H4_AVX2 1 INIT_YMM avx2 @@ -1982,6 +1984,7 @@ %endmacro ADDAVG_W12_H4_AVX2 16 +ADDAVG_W12_H4_AVX2 32 %macro ADDAVG_W16_H4_AVX2 1 INIT_YMM avx2 @@ -2044,6 +2047,7 @@ ADDAVG_W16_H4_AVX2 8 ADDAVG_W16_H4_AVX2 12 ADDAVG_W16_H4_AVX2 16 +ADDAVG_W16_H4_AVX2 24 ADDAVG_W16_H4_AVX2 32 ADDAVG_W16_H4_AVX2 64 @@ -2101,6 +2105,7 @@ %endmacro ADDAVG_W24_H2_AVX2 32 +ADDAVG_W24_H2_AVX2 64 %macro ADDAVG_W32_H2_AVX2 1 INIT_YMM avx2 @@ -2157,6 +2162,7 @@ ADDAVG_W32_H2_AVX2 16 ADDAVG_W32_H2_AVX2 24 ADDAVG_W32_H2_AVX2 32 +ADDAVG_W32_H2_AVX2 48 ADDAVG_W32_H2_AVX2 64 %macro ADDAVG_W64_H2_AVX2 1
View file
x265_1.6.tar.gz/source/common/x86/pixel-a.asm -> x265_1.7.tar.gz/source/common/x86/pixel-a.asm
Changed
@@ -7078,6 +7078,117 @@ .end: RET +; Input 16bpp, Output 8bpp +;------------------------------------------------------------------------------------------------------------------------------------- +;void planecopy_sp(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask) +;------------------------------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal downShift_16, 6,7,3 + movd xm0, r6m ; m0 = shift + add r1d, r1d + dec r5d +.loopH: + xor r6, r6 +.loopW: + movu m1, [r0 + r6 * 2 + 0] + movu m2, [r0 + r6 * 2 + 32] + vpsrlw m1, xm0 + vpsrlw m2, xm0 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r2 + r6], m1 + + add r6d, mmsize + cmp r6d, r4d + jl .loopW + + ; move to next row + add r0, r1 + add r2, r3 + dec r5d + jnz .loopH + +; processing last row of every frame [To handle width which not a multiple of 32] + mov r6d, r4d + and r4d, 31 + shr r6d, 5 + +.loop32: + movu m1, [r0] + movu m2, [r0 + 32] + psrlw m1, xm0 + psrlw m2, xm0 + packuswb m1, m2 + vpermq m1, m1, 11011000b + movu [r2], m1 + + add r0, 2*mmsize + add r2, mmsize + dec r6d + jnz .loop32 + + cmp r4d, 16 + jl .process8 + movu m1, [r0] + psrlw m1, xm0 + packuswb m1, m1 + vpermq m1, m1, 10001000b + movu [r2], xm1 + + add r0, mmsize + add r2, 16 + sub r4d, 16 + jz .end + +.process8: + cmp r4d, 8 + jl .process4 + movu m1, [r0] + psrlw m1, xm0 + packuswb m1, m1 + movq [r2], xm1 + + add r0, 16 + add r2, 8 + sub r4d, 8 + jz .end + +.process4: + cmp r4d, 4 + jl .process2 + movq xm1,[r0] + psrlw m1, xm0 + packuswb m1, m1 + movd [r2], xm1 + + add r0, 8 + add r2, 4 + sub r4d, 4 + jz .end + +.process2: + cmp r4d, 2 + jl .process1 + movd xm1, [r0] + psrlw m1, xm0 + packuswb m1, m1 + movd r6d, xm1 + mov [r2], r6w + + add r0, 4 + add r2, 2 + sub r4d, 2 + jz .end + +.process1: + movd xm1, [r0] + psrlw m1, xm0 + packuswb m1, m1 + movd r3d, xm1 + mov [r2], r3b +.end: + RET + ; Input 8bpp, Output 16bpp ;--------------------------------------------------------------------------------------------------------------------- ;void planecopy_cp(uint8_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift) @@ -10395,3 +10506,1372 @@ mov rsp, r5 RET %endif + +;;--------------------------------------------------------------- +;; SATD AVX2 +;; int pixel_satd(const pixel*, intptr_t, const pixel*, intptr_t) +;;--------------------------------------------------------------- +;; r0 - pix0 +;; r1 - pix0Stride +;; r2 - pix1 +;; r3 - pix1Stride + +%if ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0 +INIT_YMM avx2 +cglobal calc_satd_16x8 ; function to compute satd cost for 16 columns, 8 rows + pxor m6, m6 + vbroadcasti128 m0, [r0] + vbroadcasti128 m4, [r2] + vbroadcasti128 m1, [r0 + r1] + vbroadcasti128 m5, [r2 + r3] + pmaddubsw m4, m7 + pmaddubsw m0, m7 + pmaddubsw m5, m7 + pmaddubsw m1, m7 + psubw m0, m4 + psubw m1, m5 + vbroadcasti128 m2, [r0 + r1 * 2] + vbroadcasti128 m4, [r2 + r3 * 2] + vbroadcasti128 m3, [r0 + r4] + vbroadcasti128 m5, [r2 + r5] + pmaddubsw m4, m7 + pmaddubsw m2, m7 + pmaddubsw m5, m7 + pmaddubsw m3, m7 + psubw m2, m4 + psubw m3, m5 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + paddw m4, m0, m1 + psubw m1, m1, m0 + paddw m0, m2, m3 + psubw m3, m2 + paddw m2, m4, m0 + psubw m0, m4 + paddw m4, m1, m3 + psubw m3, m1 + pabsw m2, m2 + pabsw m0, m0 + pabsw m4, m4 + pabsw m3, m3 + pblendw m1, m2, m0, 10101010b + pslld m0, 16 + psrld m2, 16 + por m0, m2 + pmaxsw m1, m0 + paddw m6, m1 + pblendw m2, m4, m3, 10101010b + pslld m3, 16 + psrld m4, 16 + por m3, m4 + pmaxsw m2, m3 + paddw m6, m2 + vbroadcasti128 m1, [r0] + vbroadcasti128 m4, [r2] + vbroadcasti128 m2, [r0 + r1] + vbroadcasti128 m5, [r2 + r3] + pmaddubsw m4, m7 + pmaddubsw m1, m7 + pmaddubsw m5, m7 + pmaddubsw m2, m7 + psubw m1, m4 + psubw m2, m5 + vbroadcasti128 m0, [r0 + r1 * 2] + vbroadcasti128 m4, [r2 + r3 * 2] + vbroadcasti128 m3, [r0 + r4] + vbroadcasti128 m5, [r2 + r5] + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + pmaddubsw m4, m7 + pmaddubsw m0, m7 + pmaddubsw m5, m7 + pmaddubsw m3, m7 + psubw m0, m4 + psubw m3, m5 + paddw m4, m1, m2 + psubw m2, m1 + paddw m1, m0, m3 + psubw m3, m0 + paddw m0, m4, m1 + psubw m1, m4 + paddw m4, m2, m3 + psubw m3, m2 + pabsw m0, m0 + pabsw m1, m1 + pabsw m4, m4 + pabsw m3, m3 + pblendw m2, m0, m1, 10101010b + pslld m1, 16 + psrld m0, 16 + por m1, m0 + pmaxsw m2, m1 + paddw m6, m2 + pblendw m0, m4, m3, 10101010b + pslld m3, 16 + psrld m4, 16 + por m3, m4 + pmaxsw m0, m3 + paddw m6, m0 + vextracti128 xm0, m6, 1 + pmovzxwd m6, xm6 + pmovzxwd m0, xm0 + paddd m8, m6 + paddd m9, m0 + ret + +cglobal calc_satd_16x4 ; function to compute satd cost for 16 columns, 4 rows + pxor m6, m6 + vbroadcasti128 m0, [r0] + vbroadcasti128 m4, [r2] + vbroadcasti128 m1, [r0 + r1] + vbroadcasti128 m5, [r2 + r3] + pmaddubsw m4, m7 + pmaddubsw m0, m7 + pmaddubsw m5, m7 + pmaddubsw m1, m7 + psubw m0, m4 + psubw m1, m5 + vbroadcasti128 m2, [r0 + r1 * 2] + vbroadcasti128 m4, [r2 + r3 * 2] + vbroadcasti128 m3, [r0 + r4] + vbroadcasti128 m5, [r2 + r5] + pmaddubsw m4, m7 + pmaddubsw m2, m7 + pmaddubsw m5, m7 + pmaddubsw m3, m7 + psubw m2, m4 + psubw m3, m5 + paddw m4, m0, m1 + psubw m1, m1, m0 + paddw m0, m2, m3 + psubw m3, m2 + paddw m2, m4, m0 + psubw m0, m4 + paddw m4, m1, m3 + psubw m3, m1 + pabsw m2, m2 + pabsw m0, m0 + pabsw m4, m4 + pabsw m3, m3 + pblendw m1, m2, m0, 10101010b + pslld m0, 16 + psrld m2, 16 + por m0, m2 + pmaxsw m1, m0 + paddw m6, m1 + pblendw m2, m4, m3, 10101010b + pslld m3, 16 + psrld m4, 16 + por m3, m4 + pmaxsw m2, m3 + paddw m6, m2 + vextracti128 xm0, m6, 1 + pmovzxwd m6, xm6 + pmovzxwd m0, xm0 + paddd m8, m6 + paddd m9, m0 + ret + +cglobal pixel_satd_16x4, 4,6,10 ; if WIN64 && cpuflag(avx2) + mova m7, [hmul_16p] + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m8, m8 + pxor m9, m9 + + call calc_satd_16x4 + + paddd m8, m9 + vextracti128 xm0, m8, 1 + paddd xm0, xm8 + movhlps xm1, xm0 + paddd xm0, xm1 + pshuflw xm1, xm0, q0032 + paddd xm0, xm1 + movd eax, xm0 + RET + +cglobal pixel_satd_16x12, 4,6,10 ; if WIN64 && cpuflag(avx2) + mova m7, [hmul_16p] + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m8, m8 + pxor m9, m9 + + call calc_satd_16x8 + call calc_satd_16x4 + + paddd m8, m9 + vextracti128 xm0, m8, 1 + paddd xm0, xm8 + movhlps xm1, xm0 + paddd xm0, xm1 + pshuflw xm1, xm0, q0032 + paddd xm0, xm1 + movd eax, xm0 + RET + +cglobal pixel_satd_16x32, 4,6,10 ; if WIN64 && cpuflag(avx2) + mova m7, [hmul_16p] + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m8, m8 + pxor m9, m9 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + paddd m8, m9 + vextracti128 xm0, m8, 1 + paddd xm0, xm8 + movhlps xm1, xm0 + paddd xm0, xm1 + pshuflw xm1, xm0, q0032 + paddd xm0, xm1 + movd eax, xm0 + RET + +cglobal pixel_satd_16x64, 4,6,10 ; if WIN64 && cpuflag(avx2) + mova m7, [hmul_16p] + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m8, m8 + pxor m9, m9 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + paddd m8, m9 + vextracti128 xm0, m8, 1 + paddd xm0, xm8 + movhlps xm1, xm0 + paddd xm0, xm1 + pshuflw xm1, xm0, q0032 + paddd xm0, xm1 + movd eax, xm0 + RET + +cglobal pixel_satd_32x8, 4,8,10 ; if WIN64 && cpuflag(avx2) + mova m7, [hmul_16p] + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m8, m8 + pxor m9, m9 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + + call calc_satd_16x8 + + paddd m8, m9 + vextracti128 xm0, m8, 1 + paddd xm0, xm8 + movhlps xm1, xm0 + paddd xm0, xm1 + pshuflw xm1, xm0, q0032 + paddd xm0, xm1 + movd eax, xm0 + RET + +cglobal pixel_satd_32x16, 4,8,10 ; if WIN64 && cpuflag(avx2) + mova m7, [hmul_16p] + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m8, m8 + pxor m9, m9 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + + call calc_satd_16x8 + call calc_satd_16x8 + + paddd m8, m9 + vextracti128 xm0, m8, 1 + paddd xm0, xm8 + movhlps xm1, xm0 + paddd xm0, xm1 + pshuflw xm1, xm0, q0032 + paddd xm0, xm1 + movd eax, xm0 + RET + +cglobal pixel_satd_32x24, 4,8,10 ; if WIN64 && cpuflag(avx2) + mova m7, [hmul_16p] + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m8, m8 + pxor m9, m9 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + paddd m8, m9 + vextracti128 xm0, m8, 1 + paddd xm0, xm8 + movhlps xm1, xm0 + paddd xm0, xm1 + pshuflw xm1, xm0, q0032 + paddd xm0, xm1 + movd eax, xm0 + RET + +cglobal pixel_satd_32x32, 4,8,10 ; if WIN64 && cpuflag(avx2) + mova m7, [hmul_16p] + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m8, m8 + pxor m9, m9 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + paddd m8, m9 + vextracti128 xm0, m8, 1 + paddd xm0, xm8 + movhlps xm1, xm0 + paddd xm0, xm1 + pshuflw xm1, xm0, q0032 + paddd xm0, xm1 + movd eax, xm0 + RET + +cglobal pixel_satd_32x64, 4,8,10 ; if WIN64 && cpuflag(avx2) + mova m7, [hmul_16p] + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m8, m8 + pxor m9, m9 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + paddd m8, m9 + vextracti128 xm0, m8, 1 + paddd xm0, xm8 + movhlps xm1, xm0 + paddd xm0, xm1 + pshuflw xm1, xm0, q0032 + paddd xm0, xm1 + movd eax, xm0 + RET + +cglobal pixel_satd_48x64, 4,8,10 ; if WIN64 && cpuflag(avx2) + mova m7, [hmul_16p] + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m8, m8 + pxor m9, m9 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + paddd m8, m9 + vextracti128 xm0, m8, 1 + paddd xm0, xm8 + movhlps xm1, xm0 + paddd xm0, xm1 + pshuflw xm1, xm0, q0032 + paddd xm0, xm1 + movd eax, xm0 + RET + +cglobal pixel_satd_64x16, 4,8,10 ; if WIN64 && cpuflag(avx2) + mova m7, [hmul_16p] + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m8, m8 + pxor m9, m9 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call calc_satd_16x8 + call calc_satd_16x8 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call calc_satd_16x8 + call calc_satd_16x8 + lea r0, [r6 + 48] + lea r2, [r7 + 48] + call calc_satd_16x8 + call calc_satd_16x8 + + paddd m8, m9 + vextracti128 xm0, m8, 1 + paddd xm0, xm8 + movhlps xm1, xm0 + paddd xm0, xm1 + pshuflw xm1, xm0, q0032 + paddd xm0, xm1 + movd eax, xm0 + RET + +cglobal pixel_satd_64x32, 4,8,10 ; if WIN64 && cpuflag(avx2) + mova m7, [hmul_16p] + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m8, m8 + pxor m9, m9 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + lea r0, [r6 + 48] + lea r2, [r7 + 48] + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + paddd m8, m9 + vextracti128 xm0, m8, 1 + paddd xm0, xm8 + movhlps xm1, xm0 + paddd xm0, xm1 + pshuflw xm1, xm0, q0032 + paddd xm0, xm1 + movd eax, xm0 + RET + +cglobal pixel_satd_64x48, 4,8,10 ; if WIN64 && cpuflag(avx2) + mova m7, [hmul_16p] + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m8, m8 + pxor m9, m9 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + lea r0, [r6 + 48] + lea r2, [r7 + 48] + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + paddd m8, m9 + vextracti128 xm0, m8, 1 + paddd xm0, xm8 + movhlps xm1, xm0 + paddd xm0, xm1 + pshuflw xm1, xm0, q0032 + paddd xm0, xm1 + movd eax, xm0 + RET + +cglobal pixel_satd_64x64, 4,8,10 ; if WIN64 && cpuflag(avx2) + mova m7, [hmul_16p] + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m8, m8 + pxor m9, m9 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + lea r0, [r6 + 48] + lea r2, [r7 + 48] + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + paddd m8, m9 + vextracti128 xm0, m8, 1 + paddd xm0, xm8 + movhlps xm1, xm0 + paddd xm0, xm1 + pshuflw xm1, xm0, q0032 + paddd xm0, xm1 + movd eax, xm0 + RET +%endif ; ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0 + +%if ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 1 +INIT_YMM avx2 +cglobal calc_satd_16x8 ; function to compute satd cost for 16 columns, 8 rows + ; rows 0-3 + movu m0, [r0] + movu m4, [r2] + psubw m0, m4 + movu m1, [r0 + r1] + movu m5, [r2 + r3] + psubw m1, m5 + movu m2, [r0 + r1 * 2] + movu m4, [r2 + r3 * 2] + psubw m2, m4 + movu m3, [r0 + r4] + movu m5, [r2 + r5] + psubw m3, m5 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + paddw m4, m0, m1 + psubw m1, m0 + paddw m0, m2, m3 + psubw m3, m2 + punpckhwd m2, m4, m1 + punpcklwd m4, m1 + punpckhwd m1, m0, m3 + punpcklwd m0, m3 + paddw m3, m4, m0 + psubw m0, m4 + paddw m4, m2, m1 + psubw m1, m2 + punpckhdq m2, m3, m0 + punpckldq m3, m0 + paddw m0, m3, m2 + psubw m2, m3 + punpckhdq m3, m4, m1 + punpckldq m4, m1 + paddw m1, m4, m3 + psubw m3, m4 + punpckhqdq m4, m0, m1 + punpcklqdq m0, m1 + pabsw m0, m0 + pabsw m4, m4 + pmaxsw m0, m0, m4 + punpckhqdq m1, m2, m3 + punpcklqdq m2, m3 + pabsw m2, m2 + pabsw m1, m1 + pmaxsw m2, m1 + pxor m7, m7 + mova m1, m0 + punpcklwd m1, m7 + paddd m6, m1 + mova m1, m0 + punpckhwd m1, m7 + paddd m6, m1 + pxor m7, m7 + mova m1, m2 + punpcklwd m1, m7 + paddd m6, m1 + mova m1, m2 + punpckhwd m1, m7 + paddd m6, m1 + ; rows 4-7 + movu m0, [r0] + movu m4, [r2] + psubw m0, m4 + movu m1, [r0 + r1] + movu m5, [r2 + r3] + psubw m1, m5 + movu m2, [r0 + r1 * 2] + movu m4, [r2 + r3 * 2] + psubw m2, m4 + movu m3, [r0 + r4] + movu m5, [r2 + r5] + psubw m3, m5 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + paddw m4, m0, m1 + psubw m1, m0 + paddw m0, m2, m3 + psubw m3, m2 + punpckhwd m2, m4, m1 + punpcklwd m4, m1 + punpckhwd m1, m0, m3 + punpcklwd m0, m3 + paddw m3, m4, m0 + psubw m0, m4 + paddw m4, m2, m1 + psubw m1, m2 + punpckhdq m2, m3, m0 + punpckldq m3, m0 + paddw m0, m3, m2 + psubw m2, m3 + punpckhdq m3, m4, m1 + punpckldq m4, m1 + paddw m1, m4, m3 + psubw m3, m4 + punpckhqdq m4, m0, m1 + punpcklqdq m0, m1 + pabsw m0, m0 + pabsw m4, m4 + pmaxsw m0, m0, m4 + punpckhqdq m1, m2, m3 + punpcklqdq m2, m3 + pabsw m2, m2 + pabsw m1, m1 + pmaxsw m2, m1 + pxor m7, m7 + mova m1, m0 + punpcklwd m1, m7 + paddd m6, m1 + mova m1, m0 + punpckhwd m1, m7 + paddd m6, m1 + pxor m7, m7 + mova m1, m2 + punpcklwd m1, m7 + paddd m6, m1 + mova m1, m2 + punpckhwd m1, m7 + paddd m6, m1 + ret + +cglobal calc_satd_16x4 ; function to compute satd cost for 16 columns, 4 rows + ; rows 0-3 + movu m0, [r0] + movu m4, [r2] + psubw m0, m4 + movu m1, [r0 + r1] + movu m5, [r2 + r3] + psubw m1, m5 + movu m2, [r0 + r1 * 2] + movu m4, [r2 + r3 * 2] + psubw m2, m4 + movu m3, [r0 + r4] + movu m5, [r2 + r5] + psubw m3, m5 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + paddw m4, m0, m1 + psubw m1, m0 + paddw m0, m2, m3 + psubw m3, m2 + punpckhwd m2, m4, m1 + punpcklwd m4, m1 + punpckhwd m1, m0, m3 + punpcklwd m0, m3 + paddw m3, m4, m0 + psubw m0, m4 + paddw m4, m2, m1 + psubw m1, m2 + punpckhdq m2, m3, m0 + punpckldq m3, m0 + paddw m0, m3, m2 + psubw m2, m3 + punpckhdq m3, m4, m1 + punpckldq m4, m1 + paddw m1, m4, m3 + psubw m3, m4 + punpckhqdq m4, m0, m1 + punpcklqdq m0, m1 + pabsw m0, m0 + pabsw m4, m4 + pmaxsw m0, m0, m4 + punpckhqdq m1, m2, m3 + punpcklqdq m2, m3 + pabsw m2, m2 + pabsw m1, m1 + pmaxsw m2, m1 + pxor m7, m7 + mova m1, m0 + punpcklwd m1, m7 + paddd m6, m1 + mova m1, m0 + punpckhwd m1, m7 + paddd m6, m1 + pxor m7, m7 + mova m1, m2 + punpcklwd m1, m7 + paddd m6, m1 + mova m1, m2 + punpckhwd m1, m7 + paddd m6, m1 + ret + +cglobal pixel_satd_16x4, 4,6,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + + call calc_satd_16x4 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_16x8, 4,6,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + + call calc_satd_16x8 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_16x12, 4,6,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + + call calc_satd_16x8 + call calc_satd_16x4 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_16x16, 4,6,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + + call calc_satd_16x8 + call calc_satd_16x8 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_16x32, 4,6,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_16x64, 4,6,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_32x8, 4,8,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + + lea r0, [r6 + 32] + lea r2, [r7 + 32] + + call calc_satd_16x8 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_32x16, 4,8,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 32] + lea r2, [r7 + 32] + + call calc_satd_16x8 + call calc_satd_16x8 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_32x24, 4,8,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 32] + lea r2, [r7 + 32] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_32x32, 4,8,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 32] + lea r2, [r7 + 32] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_32x64, 4,8,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 32] + lea r2, [r7 + 32] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_48x64, 4,8,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 32] + lea r2, [r7 + 32] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 64] + lea r2, [r7 + 64] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_64x16, 4,8,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 32] + lea r2, [r7 + 32] + + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 64] + lea r2, [r7 + 64] + + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 96] + lea r2, [r7 + 96] + + call calc_satd_16x8 + call calc_satd_16x8 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_64x32, 4,8,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 32] + lea r2, [r7 + 32] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 64] + lea r2, [r7 + 64] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 96] + lea r2, [r7 + 96] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_64x48, 4,8,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 32] + lea r2, [r7 + 32] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 64] + lea r2, [r7 + 64] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 96] + lea r2, [r7 + 96] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET + +cglobal pixel_satd_64x64, 4,8,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + mov r6, r0 + mov r7, r2 + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 32] + lea r2, [r7 + 32] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 64] + lea r2, [r7 + 64] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + lea r0, [r6 + 96] + lea r2, [r7 + 96] + + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + call calc_satd_16x8 + + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 + RET +%endif ; ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 1
View file
x265_1.6.tar.gz/source/common/x86/pixel-util.h -> x265_1.7.tar.gz/source/common/x86/pixel-util.h
Changed
@@ -73,15 +73,18 @@ float x265_pixel_ssim_end4_sse2(int sum0[5][4], int sum1[5][4], int width); float x265_pixel_ssim_end4_avx(int sum0[5][4], int sum1[5][4], int width); -void x265_scale1D_128to64_ssse3(pixel*, const pixel*, intptr_t); -void x265_scale1D_128to64_avx2(pixel*, const pixel*, intptr_t); +void x265_scale1D_128to64_ssse3(pixel*, const pixel*); +void x265_scale1D_128to64_avx2(pixel*, const pixel*); void x265_scale2D_64to32_ssse3(pixel*, const pixel*, intptr_t); +void x265_scale2D_64to32_avx2(pixel*, const pixel*, intptr_t); -int x265_findPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig); +int x265_scanPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize); +int x265_scanPosLast_avx2_bmi2(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize); +uint32_t x265_findPosFirstLast_ssse3(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]); #define SETUP_CHROMA_PIXELSUB_PS_FUNC(W, H, cpu) \ void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t* dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \ - void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t* scr1, intptr_t srcStride0, intptr_t srcStride1); + void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t* src1, intptr_t srcStride0, intptr_t srcStride1); #define CHROMA_420_PIXELSUB_DEF(cpu) \ SETUP_CHROMA_PIXELSUB_PS_FUNC(4, 4, cpu); \ @@ -97,7 +100,7 @@ #define SETUP_LUMA_PIXELSUB_PS_FUNC(W, H, cpu) \ void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t* dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \ - void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t* scr1, intptr_t srcStride0, intptr_t srcStride1); + void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t* src1, intptr_t srcStride0, intptr_t srcStride1); #define LUMA_PIXELSUB_DEF(cpu) \ SETUP_LUMA_PIXELSUB_PS_FUNC(8, 8, cpu); \
View file
x265_1.6.tar.gz/source/common/x86/pixel-util8.asm -> x265_1.7.tar.gz/source/common/x86/pixel-util8.asm
Changed
@@ -40,16 +40,17 @@ ssim_c1: times 4 dd 416 ; .01*.01*255*255*64 ssim_c2: times 4 dd 235963 ; .03*.03*255*255*64*63 %endif -mask_ff: times 16 db 0xff - times 16 db 0 -deinterleave_shuf: db 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15 -deinterleave_word_shuf: db 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15 -hmul_16p: times 16 db 1 - times 8 db 1, -1 -hmulw_16p: times 8 dw 1 - times 4 dw 1, -1 -trans8_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7 +mask_ff: times 16 db 0xff + times 16 db 0 +deinterleave_shuf: times 2 db 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15 +deinterleave_word_shuf: times 2 db 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15 +hmul_16p: times 16 db 1 + times 8 db 1, -1 +hmulw_16p: times 8 dw 1 + times 4 dw 1, -1 + +trans8_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7 SECTION .text @@ -67,6 +68,7 @@ cextern pb_2 cextern pb_4 cextern pb_8 +cextern pb_15 cextern pb_16 cextern pb_32 cextern pb_64 @@ -616,7 +618,7 @@ %if ARCH_X86_64 == 1 INIT_YMM avx2 -cglobal quant, 5,5,10 +cglobal quant, 5,6,9 ; fill qbits movd xm4, r4d ; m4 = qbits @@ -627,7 +629,7 @@ ; fill offset vpbroadcastd m5, r5m ; m5 = add - vpbroadcastw m9, [pw_1] ; m9 = word [1] + lea r5, [pw_1] mov r4d, r6m shr r4d, 4 @@ -665,7 +667,7 @@ ; count non-zero coeff ; TODO: popcnt is faster, but some CPU can't support - pminuw m2, m9 + pminuw m2, [r5] paddw m7, m2 add r0, mmsize @@ -1285,9 +1287,8 @@ mov r6d, r6m shl r6d, 16 or r6d, r5d ; assuming both (w0<<6) and round are using maximum of 16 bits each. - movd xm0, r6d - pshufd xm0, xm0, 0 ; m0 = [w0<<6, round] - vinserti128 m0, m0, xm0, 1 ; document says (pshufd + vinserti128) can be replaced with vpbroadcastd m0, xm0, but having build problem, need to investigate + + vpbroadcastd m0, r6d movd xm1, r7m vpbroadcastd m2, r8m @@ -1492,6 +1493,84 @@ dec r5d jnz .loopH RET + +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal weight_sp, 6, 9, 7 + mov r7d, r7m + shl r7d, 16 + or r7d, r6m + vpbroadcastd m0, r7d ; m0 = times 8 dw w0, round + movd xm1, r8m ; m1 = [shift] + vpbroadcastd m2, r9m ; m2 = times 16 dw offset + vpbroadcastw m3, [pw_1] + vpbroadcastw m4, [pw_2000] + + add r2d, r2d ; 2 * srcstride + + mov r7, r0 + mov r8, r1 +.loopH: + mov r6d, r4d ; width + + ; save old src and dst + mov r0, r7 ; src + mov r1, r8 ; dst +.loopW: + movu m5, [r0] + paddw m5, m4 + + punpcklwd m6,m5, m3 + pmaddwd m6, m0 + psrad m6, xm1 + paddd m6, m2 + + punpckhwd m5, m3 + pmaddwd m5, m0 + psrad m5, xm1 + paddd m5, m2 + + packssdw m6, m5 + packuswb m6, m6 + vpermq m6, m6, 10001000b + + sub r6d, 16 + jl .width8 + movu [r1], xm6 + je .nextH + add r0, 32 + add r1, 16 + jmp .loopW + +.width8: + add r6d, 16 + cmp r6d, 8 + jl .width4 + movq [r1], xm6 + je .nextH + psrldq m6, 8 + sub r6d, 8 + add r1, 8 + +.width4: + cmp r6d, 4 + jl .width2 + movd [r1], xm6 + je .nextH + add r1, 4 + pshufd m6, m6, 1 + +.width2: + pextrw [r1], xm6, 0 + +.nextH: + lea r7, [r7 + r2] + lea r8, [r8 + r3] + + dec r5d + jnz .loopH + RET +%endif %endif ; end of (HIGH_BIT_DEPTH == 0) @@ -3944,6 +4023,150 @@ RET %endif +;----------------------------------------------------------------- +; void scale2D_64to32(pixel *dst, pixel *src, intptr_t stride) +;----------------------------------------------------------------- +%if HIGH_BIT_DEPTH +INIT_YMM avx2 +cglobal scale2D_64to32, 3, 4, 5, dest, src, stride + mov r3d, 32 + add r2d, r2d + mova m4, [pw_2000] + +.loop: + movu m0, [r1] + movu m1, [r1 + 1 * mmsize] + movu m2, [r1 + r2] + movu m3, [r1 + r2 + 1 * mmsize] + + paddw m0, m2 + paddw m1, m3 + phaddw m0, m1 + + pmulhrsw m0, m4 + vpermq m0, m0, q3120 + movu [r0], m0 + + movu m0, [r1 + 2 * mmsize] + movu m1, [r1 + 3 * mmsize] + movu m2, [r1 + r2 + 2 * mmsize] + movu m3, [r1 + r2 + 3 * mmsize] + + paddw m0, m2 + paddw m1, m3 + phaddw m0, m1 + + pmulhrsw m0, m4 + vpermq m0, m0, q3120 + movu [r0 + mmsize], m0 + + add r0, 64 + lea r1, [r1 + 2 * r2] + dec r3d + jnz .loop + RET +%else + +INIT_YMM avx2 +cglobal scale2D_64to32, 3, 5, 8, dest, src, stride + mov r3d, 16 + mova m7, [deinterleave_shuf] +.loop: + movu m0, [r1] ; i + lea r4, [r1 + r2 * 2] + psrlw m1, m0, 8 ; j + movu m2, [r1 + r2] ; k + psrlw m3, m2, 8 ; l + + pxor m4, m0, m1 ; i^j + pxor m5, m2, m3 ; k^l + por m4, m5 ; ij|kl + + pavgb m0, m1 ; s + pavgb m2, m3 ; t + mova m5, m0 + pavgb m0, m2 ; (s+t+1)/2 + pxor m5, m2 ; s^t + pand m4, m5 ; (ij|kl)&st + pand m4, [pb_1] + psubb m0, m4 ; Result + + movu m1, [r1 + 32] ; i + psrlw m2, m1, 8 ; j + movu m3, [r1 + r2 + 32] ; k + psrlw m4, m3, 8 ; l + + pxor m5, m1, m2 ; i^j + pxor m6, m3, m4 ; k^l + por m5, m6 ; ij|kl + + pavgb m1, m2 ; s + pavgb m3, m4 ; t + mova m6, m1 + pavgb m1, m3 ; (s+t+1)/2 + pxor m6, m3 ; s^t + pand m5, m6 ; (ij|kl)&st + pand m5, [pb_1] + psubb m1, m5 ; Result + + pshufb m0, m0, m7 + pshufb m1, m1, m7 + + punpcklqdq m0, m1 + vpermq m0, m0, 11011000b + movu [r0], m0 + + add r0, 32 + + movu m0, [r4] ; i + psrlw m1, m0, 8 ; j + movu m2, [r4 + r2] ; k + psrlw m3, m2, 8 ; l + + pxor m4, m0, m1 ; i^j + pxor m5, m2, m3 ; k^l + por m4, m5 ; ij|kl + + pavgb m0, m1 ; s + pavgb m2, m3 ; t + mova m5, m0 + pavgb m0, m2 ; (s+t+1)/2 + pxor m5, m2 ; s^t + pand m4, m5 ; (ij|kl)&st + pand m4, [pb_1] + psubb m0, m4 ; Result + + movu m1, [r4 + 32] ; i + psrlw m2, m1, 8 ; j + movu m3, [r4 + r2 + 32] ; k + psrlw m4, m3, 8 ; l + + pxor m5, m1, m2 ; i^j + pxor m6, m3, m4 ; k^l + por m5, m6 ; ij|kl + + pavgb m1, m2 ; s + pavgb m3, m4 ; t + mova m6, m1 + pavgb m1, m3 ; (s+t+1)/2 + pxor m6, m3 ; s^t + pand m5, m6 ; (ij|kl)&st + pand m5, [pb_1] + psubb m1, m5 ; Result + + pshufb m0, m0, m7 + pshufb m1, m1, m7 + + punpcklqdq m0, m1 + vpermq m0, m0, 11011000b + movu [r0], m0 + + lea r1, [r1 + 4 * r2] + add r0, 32 + dec r3d + jnz .loop + RET +%endif ;----------------------------------------------------------------------------- ; void pixel_sub_ps_4x4(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1); @@ -4337,18 +4560,70 @@ ;----------------------------------------------------------------------------- ; void pixel_sub_ps_16x16(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1); ;----------------------------------------------------------------------------- +%if HIGH_BIT_DEPTH +%macro PIXELSUB_PS_W16_H4_avx2 1 +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_sub_ps_16x%1, 6, 9, 4, dest, deststride, src0, src1, srcstride0, srcstride1 + add r1d, r1d + add r4d, r4d + add r5d, r5d + lea r6, [r1 * 3] + lea r7, [r4 * 3] + lea r8, [r5 * 3] + +%rep %1/4 + movu m0, [r2] + movu m1, [r3] + movu m2, [r2 + r4] + movu m3, [r3 + r5] + + psubw m0, m1 + psubw m2, m3 + + movu [r0], m0 + movu [r0 + r1], m2 + + movu m0, [r2 + r4 * 2] + movu m1, [r3 + r5 * 2] + movu m2, [r2 + r7] + movu m3, [r3 + r8] + + psubw m0, m1 + psubw m2, m3 + + movu [r0 + r1 * 2], m0 + movu [r0 + r6], m2 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] +%endrep + RET +%endif +%endmacro +PIXELSUB_PS_W16_H4_avx2 16 +PIXELSUB_PS_W16_H4_avx2 32 +%else +;----------------------------------------------------------------------------- +; void pixel_sub_ps_16x16(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1); +;----------------------------------------------------------------------------- +%macro PIXELSUB_PS_W16_H8_avx2 2 +%if ARCH_X86_64 INIT_YMM avx2 -cglobal pixel_sub_ps_16x16, 6, 7, 4, dest, deststride, src0, src1, srcstride0, srcstride1 +cglobal pixel_sub_ps_16x%2, 6, 10, 4, dest, deststride, src0, src1, srcstride0, srcstride1 add r1, r1 lea r6, [r1 * 3] + mov r7d, %2/8 -%rep 4 + lea r9, [r4 * 3] + lea r8, [r5 * 3] + +.loop pmovzxbw m0, [r2] pmovzxbw m1, [r3] pmovzxbw m2, [r2 + r4] pmovzxbw m3, [r3 + r5] - lea r2, [r2 + r4 * 2] - lea r3, [r3 + r5 * 2] psubw m0, m1 psubw m2, m3 @@ -4356,6 +4631,21 @@ movu [r0], m0 movu [r0 + r1], m2 + pmovzxbw m0, [r2 + 2 * r4] + pmovzxbw m1, [r3 + 2 * r5] + pmovzxbw m2, [r2 + r9] + pmovzxbw m3, [r3 + r8] + + psubw m0, m1 + psubw m2, m3 + + movu [r0 + r1 * 2], m0 + movu [r0 + r6], m2 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + pmovzxbw m0, [r2] pmovzxbw m1, [r3] pmovzxbw m2, [r2 + r4] @@ -4364,14 +4654,34 @@ psubw m0, m1 psubw m2, m3 + movu [r0], m0 + movu [r0 + r1], m2 + + pmovzxbw m0, [r2 + 2 * r4] + pmovzxbw m1, [r3 + 2 * r5] + pmovzxbw m2, [r2 + r9] + pmovzxbw m3, [r3 + r8] + + psubw m0, m1 + psubw m2, m3 + movu [r0 + r1 * 2], m0 movu [r0 + r6], m2 lea r0, [r0 + r1 * 4] - lea r2, [r2 + r4 * 2] - lea r3, [r3 + r5 * 2] -%endrep + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + + dec r7d + jnz .loop RET +%endif +%endmacro + +PIXELSUB_PS_W16_H8_avx2 16, 16 +PIXELSUB_PS_W16_H8_avx2 16, 32 +%endif + ;----------------------------------------------------------------------------- ; void pixel_sub_ps_32x%2(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1); ;----------------------------------------------------------------------------- @@ -4509,10 +4819,83 @@ ;----------------------------------------------------------------------------- ; void pixel_sub_ps_32x32(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1); ;----------------------------------------------------------------------------- +%if HIGH_BIT_DEPTH +%macro PIXELSUB_PS_W32_H4_avx2 1 +%if ARCH_X86_64 INIT_YMM avx2 -cglobal pixel_sub_ps_32x32, 6, 7, 4, dest, deststride, src0, src1, srcstride0, srcstride1 - mov r6d, 4 - add r1, r1 +cglobal pixel_sub_ps_32x%1, 6, 10, 4, dest, deststride, src0, src1, srcstride0, srcstride1 + add r1d, r1d + add r4d, r4d + add r5d, r5d + mov r9d, %1/4 + lea r6, [r1 * 3] + lea r7, [r4 * 3] + lea r8, [r5 * 3] + +.loop + movu m0, [r2] + movu m1, [r2 + 32] + movu m2, [r3] + movu m3, [r3 + 32] + psubw m0, m2 + psubw m1, m3 + + movu [r0], m0 + movu [r0 + 32], m1 + + movu m0, [r2 + r4] + movu m1, [r2 + r4 + 32] + movu m2, [r3 + r5] + movu m3, [r3 + r5 + 32] + psubw m0, m2 + psubw m1, m3 + + movu [r0 + r1], m0 + movu [r0 + r1 + 32], m1 + + movu m0, [r2 + r4 * 2] + movu m1, [r2 + r4 * 2 + 32] + movu m2, [r3 + r5 * 2] + movu m3, [r3 + r5 * 2 + 32] + psubw m0, m2 + psubw m1, m3 + + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m1 + + movu m0, [r2 + r7] + movu m1, [r2 + r7 + 32] + movu m2, [r3 + r8] + movu m3, [r3 + r8 + 32] + psubw m0, m2 + psubw m1, m3 + + movu [r0 + r6], m0 + movu [r0 + r6 + 32], m1 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + dec r9d + jnz .loop + RET +%endif +%endmacro +PIXELSUB_PS_W32_H4_avx2 32 +PIXELSUB_PS_W32_H4_avx2 64 +%else +;----------------------------------------------------------------------------- +; void pixel_sub_ps_32x32(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1); +;----------------------------------------------------------------------------- +%macro PIXELSUB_PS_W32_H8_avx2 2 +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_sub_ps_32x%2, 6, 10, 4, dest, deststride, src0, src1, srcstride0, srcstride1 + mov r6d, %2/8 + add r1, r1 + lea r7, [r4 * 3] + lea r8, [r5 * 3] + lea r9, [r1 * 3] .loop: pmovzxbw m0, [r2] @@ -4537,55 +4920,44 @@ movu [r0 + r1], m0 movu [r0 + r1 + 32], m1 - add r2, r4 - add r3, r5 - - pmovzxbw m0, [r2 + r4] - pmovzxbw m1, [r2 + r4 + 16] - pmovzxbw m2, [r3 + r5] - pmovzxbw m3, [r3 + r5 + 16] + pmovzxbw m0, [r2 + 2 * r4] + pmovzxbw m1, [r2 + 2 * r4 + 16] + pmovzxbw m2, [r3 + 2 * r5] + pmovzxbw m3, [r3 + 2 * r5 + 16] psubw m0, m2 psubw m1, m3 - lea r0, [r0 + r1 * 2] - movu [r0 ], m0 - movu [r0 + 32], m1 - - add r2, r4 - add r3, r5 + movu [r0 + r1 * 2 ], m0 + movu [r0 + r1 * 2 + 32], m1 - pmovzxbw m0, [r2 + r4] - pmovzxbw m1, [r2 + r4 + 16] - pmovzxbw m2, [r3 + r5] - pmovzxbw m3, [r3 + r5 + 16] + pmovzxbw m0, [r2 + r7] + pmovzxbw m1, [r2 + r7 + 16] + pmovzxbw m2, [r3 + r8] + pmovzxbw m3, [r3 + r8 + 16] psubw m0, m2 psubw m1, m3 - add r0, r1 - movu [r0 ], m0 - movu [r0 + 32], m1 + movu [r0 + r9], m0 + movu [r0 + r9 +32], m1 - add r2, r4 - add r3, r5 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] - pmovzxbw m0, [r2 + r4] - pmovzxbw m1, [r2 + r4 + 16] - pmovzxbw m2, [r3 + r5] - pmovzxbw m3, [r3 + r5 + 16] + pmovzxbw m0, [r2] + pmovzxbw m1, [r2 + 16] + pmovzxbw m2, [r3] + pmovzxbw m3, [r3 + 16] psubw m0, m2 psubw m1, m3 - add r0, r1 movu [r0 ], m0 movu [r0 + 32], m1 - add r2, r4 - add r3, r5 - pmovzxbw m0, [r2 + r4] pmovzxbw m1, [r2 + r4 + 16] pmovzxbw m2, [r3 + r5] @@ -4593,48 +4965,45 @@ psubw m0, m2 psubw m1, m3 - add r0, r1 - movu [r0 ], m0 - movu [r0 + 32], m1 + movu [r0 + r1], m0 + movu [r0 + r1 + 32], m1 - add r2, r4 - add r3, r5 - - pmovzxbw m0, [r2 + r4] - pmovzxbw m1, [r2 + r4 + 16] - pmovzxbw m2, [r3 + r5] - pmovzxbw m3, [r3 + r5 + 16] + pmovzxbw m0, [r2 + 2 * r4] + pmovzxbw m1, [r2 + 2 * r4 + 16] + pmovzxbw m2, [r3 + 2 * r5] + pmovzxbw m3, [r3 + 2 * r5 + 16] psubw m0, m2 psubw m1, m3 - add r0, r1 - movu [r0 ], m0 - movu [r0 + 32], m1 + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m1 - add r2, r4 - add r3, r5 - - pmovzxbw m0, [r2 + r4] - pmovzxbw m1, [r2 + r4 + 16] - pmovzxbw m2, [r3 + r5] - pmovzxbw m3, [r3 + r5 + 16] + pmovzxbw m0, [r2 + r7] + pmovzxbw m1, [r2 + r7 + 16] + pmovzxbw m2, [r3 + r8] + pmovzxbw m3, [r3 + r8 + 16] psubw m0, m2 psubw m1, m3 - add r0, r1 - movu [r0 ], m0 - movu [r0 + 32], m1 + movu [r0 + r9], m0 + movu [r0 + r9 + 32], m1 - lea r0, [r0 + r1] - lea r2, [r2 + r4 * 2] - lea r3, [r3 + r5 * 2] + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] dec r6d jnz .loop RET +%endif +%endmacro + +PIXELSUB_PS_W32_H8_avx2 32, 32 +PIXELSUB_PS_W32_H8_avx2 32, 64 +%endif ;----------------------------------------------------------------------------- ; void pixel_sub_ps_64x%2(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1); @@ -4858,6 +5227,102 @@ ;----------------------------------------------------------------------------- ; void pixel_sub_ps_64x64(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1); ;----------------------------------------------------------------------------- +%if HIGH_BIT_DEPTH +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_sub_ps_64x64, 6, 10, 8, dest, deststride, src0, src1, srcstride0, srcstride1 + add r1d, r1d + add r4d, r4d + add r5d, r5d + mov r9d, 16 + lea r6, [r1 * 3] + lea r7, [r4 * 3] + lea r8, [r5 * 3] + +.loop + movu m0, [r2] + movu m1, [r2 + 32] + movu m2, [r2 + 64] + movu m3, [r2 + 96] + movu m4, [r3] + movu m5, [r3 + 32] + movu m6, [r3 + 64] + movu m7, [r3 + 96] + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + + movu [r0], m0 + movu [r0 + 32], m1 + movu [r0 + 64], m2 + movu [r0 + 96], m3 + + movu m0, [r2 + r4] + movu m1, [r2 + r4 + 32] + movu m2, [r2 + r4 + 64] + movu m3, [r2 + r4 + 96] + movu m4, [r3 + r5] + movu m5, [r3 + r5 + 32] + movu m6, [r3 + r5 + 64] + movu m7, [r3 + r5 + 96] + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + + movu [r0 + r1], m0 + movu [r0 + r1 + 32], m1 + movu [r0 + r1 + 64], m2 + movu [r0 + r1 + 96], m3 + + movu m0, [r2 + r4 * 2] + movu m1, [r2 + r4 * 2 + 32] + movu m2, [r2 + r4 * 2 + 64] + movu m3, [r2 + r4 * 2 + 96] + movu m4, [r3 + r5 * 2] + movu m5, [r3 + r5 * 2 + 32] + movu m6, [r3 + r5 * 2 + 64] + movu m7, [r3 + r5 * 2 + 96] + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m1 + movu [r0 + r1 * 2 + 64], m2 + movu [r0 + r1 * 2 + 96], m3 + + movu m0, [r2 + r7] + movu m1, [r2 + r7 + 32] + movu m2, [r2 + r7 + 64] + movu m3, [r2 + r7 + 96] + movu m4, [r3 + r8] + movu m5, [r3 + r8 + 32] + movu m6, [r3 + r8 + 64] + movu m7, [r3 + r8 + 96] + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + + movu [r0 + r6], m0 + movu [r0 + r6 + 32], m1 + movu [r0 + r6 + 64], m2 + movu [r0 + r6 + 96], m3 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + dec r9d + jnz .loop + RET +%endif +%else +;----------------------------------------------------------------------------- +; void pixel_sub_ps_64x64(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1); +;----------------------------------------------------------------------------- INIT_YMM avx2 cglobal pixel_sub_ps_64x64, 6, 7, 8, dest, deststride, src0, src1, srcstride0, srcstride1 mov r6d, 16 @@ -4963,7 +5428,7 @@ dec r6d jnz .loop RET - +%endif ;============================================================================= ; variance ;============================================================================= @@ -5387,7 +5852,7 @@ RET %endmacro -;int x265_test_func(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig) +;int scanPosLast(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize) ;{ ; int scanPosLast = 0; ; do @@ -5409,8 +5874,104 @@ ;} %if ARCH_X86_64 == 1 +INIT_XMM avx2,bmi2 +cglobal scanPosLast, 7,11,6 + ; convert unit of Stride(trSize) to int16_t + mov r7d, r7m + add r7d, r7d + + ; loading scan table and convert to Byte + mova m0, [r6] + packuswb m0, [r6 + mmsize] + pxor m1, m0, [pb_15] + + ; clear CG count + xor r9d, r9d + + ; m0 - Zigzag scan table + ; m1 - revert order scan table + ; m4 - zero + ; m5 - ones + + pxor m4, m4 + pcmpeqb m5, m5 + lea r8d, [r7d * 3] + +.loop: + ; position of current CG + movzx r6d, word [r0] + lea r6, [r6 * 2 + r1] + add r0, 16 * 2 + + ; loading current CG + movh m2, [r6] + movhps m2, [r6 + r7] + movh m3, [r6 + r7 * 2] + movhps m3, [r6 + r8] + packsswb m2, m3 + + ; Zigzag + pshufb m3, m2, m0 + pshufb m2, m1 + + ; get sign + pmovmskb r6d, m3 + pcmpeqb m3, m4 + pmovmskb r10d, m3 + not r10d + pext r6d, r6d, r10d + mov [r2 + r9 * 2], r6w + + ; get non-zero flag + ; TODO: reuse above result with reorder + pcmpeqb m2, m4 + pxor m2, m5 + pmovmskb r6d, m2 + mov [r3 + r9 * 2], r6w + + ; get non-zero number, POPCNT is faster + pabsb m2, m2 + psadbw m2, m4 + movhlps m3, m2 + paddd m2, m3 + movd r6d, m2 + mov [r4 + r9], r6b + + inc r9d + sub r5d, r6d + jg .loop + + ; fixup last CG non-zero flag + dec r9d + movzx r0d, word [r3 + r9 * 2] +;%if cpuflag(bmi1) ; 2uops? +; tzcnt r1d, r0d +;%else + bsf r1d, r0d +;%endif + shrx r0d, r0d, r1d + mov [r3 + r9 * 2], r0w + + ; get last pos + mov eax, r9d + shl eax, 4 + xor r1d, 15 + add eax, r1d + RET + + +; t3 must be ecx, since it's used for shift. +%if WIN64 + DECLARE_REG_TMP 3,1,2,0 +%elif ARCH_X86_64 + DECLARE_REG_TMP 0,1,2,3 +%else ; X86_32 + %error Unsupport platform X86_32 +%endif INIT_CPUFLAGS -cglobal findPosLast_x64, 5,12 +cglobal scanPosLast_x64, 5,12 + mov r10, r3mp + movifnidn t0, r0mp mov r5d, r5m xor r11d, r11d ; cgIdx xor r7d, r7d ; tmp for non-zero flag @@ -5418,40 +5979,78 @@ .loop: xor r8d, r8d ; coeffSign[] xor r9d, r9d ; coeffFlag[] - xor r10d, r10d ; coeffNum[] + xor t3d, t3d ; coeffNum[] %assign x 0 %rep 16 - movzx r6d, word [r0 + x * 2] - movsx r6d, word [r1 + r6 * 2] + movzx r6d, word [t0 + x * 2] + movsx r6d, word [t1 + r6 * 2] test r6d, r6d setnz r7b shr r6d, 31 - shlx r6d, r6d, r10d + shl r6d, t3b or r8d, r6d lea r9, [r9 * 2 + r7] - add r10d, r7d + add t3d, r7d %assign x x+1 %endrep ; store latest group data - mov [r2 + r11 * 2], r8w - mov [r3 + r11 * 2], r9w - mov [r4 + r11], r10b + mov [t2 + r11 * 2], r8w + mov [r10 + r11 * 2], r9w + mov [r4 + r11], t3b inc r11d - add r0, 16 * 2 - sub r5d, r10d + add t0, 16 * 2 + sub r5d, t3d jnz .loop ; store group data - tzcnt r6d, r9d - shrx r9d, r9d, r6d - mov [r3 + (r11 - 1) * 2], r9w + bsf t3d, r9d + shr r9d, t3b + mov [r10 + (r11 - 1) * 2], r9w ; get posLast shl r11d, 4 - sub r11d, r6d + sub r11d, t3d lea eax, [r11d - 1] RET %endif + + +;----------------------------------------------------------------------------- +; uint32_t[last first] findPosFirstAndLast(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]) +;----------------------------------------------------------------------------- +INIT_XMM ssse3 +cglobal findPosFirstLast, 3,3,3 + ; convert stride to int16_t + add r1d, r1d + + ; loading scan table and convert to Byte + mova m0, [r2] + packuswb m0, [r2 + mmsize] + + ; loading 16 of coeff + movh m1, [r0] + movhps m1, [r0 + r1] + movh m2, [r0 + r1 * 2] + lea r1, [r1 * 3] + movhps m2, [r0 + r1] + packsswb m1, m2 + + ; get non-zero mask + pxor m2, m2 + pcmpeqb m1, m2 + + ; reorder by Zigzag scan + pshufb m1, m0 + + ; get First and Last pos + xor eax, eax + pmovmskb r0d, m1 + not r0w + bsr r1w, r0w + bsf ax, r0w + shl r1d, 16 + or eax, r1d + RET
View file
x265_1.6.tar.gz/source/common/x86/pixel.h -> x265_1.7.tar.gz/source/common/x86/pixel.h
Changed
@@ -226,6 +226,7 @@ ADDAVG(addAvg_32x48) void x265_downShift_16_sse2(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); +void x265_downShift_16_avx2(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); void x265_upShift_8_sse4(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift); int x265_psyCost_pp_4x4_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); int x265_psyCost_pp_8x8_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); @@ -256,10 +257,14 @@ void x265_pixel_add_ps_16x16_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); void x265_pixel_add_ps_32x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); void x265_pixel_add_ps_64x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_add_ps_16x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_add_ps_32x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); void x265_pixel_sub_ps_16x16_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); void x265_pixel_sub_ps_32x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); void x265_pixel_sub_ps_64x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_sub_ps_16x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_sub_ps_32x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); int x265_psyCost_pp_4x4_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); int x265_psyCost_pp_8x8_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); @@ -272,6 +277,7 @@ int x265_psyCost_ss_16x16_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); int x265_psyCost_ss_32x32_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); int x265_psyCost_ss_64x64_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); +void x265_weight_sp_avx2(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset); #undef DECL_PIXELS #undef DECL_HEVC_SSD
View file
x265_1.6.tar.gz/source/common/x86/pixeladd8.asm -> x265_1.7.tar.gz/source/common/x86/pixeladd8.asm
Changed
@@ -398,10 +398,65 @@ jnz .loop RET +%endif +%endmacro +PIXEL_ADD_PS_W16_H4 16, 16 +PIXEL_ADD_PS_W16_H4 16, 32 +;----------------------------------------------------------------------------- +; void pixel_add_ps_16x16(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1) +;----------------------------------------------------------------------------- +%macro PIXEL_ADD_PS_W16_H4_avx2 1 +%if HIGH_BIT_DEPTH +%if ARCH_X86_64 INIT_YMM avx2 -cglobal pixel_add_ps_16x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1 - mov r6d, %2/4 +cglobal pixel_add_ps_16x%1, 6, 10, 4, dest, destride, src0, scr1, srcStride0, srcStride1 + mova m3, [pw_pixel_max] + pxor m2, m2 + mov r6d, %1/4 + add r4d, r4d + add r5d, r5d + add r1d, r1d + lea r7, [r4 * 3] + lea r8, [r5 * 3] + lea r9, [r1 * 3] + +.loop: + movu m0, [r2] + movu m1, [r3] + paddw m0, m1 + CLIPW m0, m2, m3 + movu [r0], m0 + + movu m0, [r2 + r4] + movu m1, [r3 + r5] + paddw m0, m1 + CLIPW m0, m2, m3 + movu [r0 + r1], m0 + + movu m0, [r2 + r4 * 2] + movu m1, [r3 + r5 * 2] + paddw m0, m1 + CLIPW m0, m2, m3 + movu [r0 + r1 * 2], m0 + + movu m0, [r2 + r7] + movu m1, [r3 + r8] + paddw m0, m1 + CLIPW m0, m2, m3 + movu [r0 + r9], m0 + + dec r6d + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + jnz .loop + RET +%endif +%else +INIT_YMM avx2 +cglobal pixel_add_ps_16x%1, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1 + mov r6d, %1/4 add r5, r5 .loop: @@ -447,8 +502,8 @@ %endif %endmacro -PIXEL_ADD_PS_W16_H4 16, 16 -PIXEL_ADD_PS_W16_H4 16, 32 +PIXEL_ADD_PS_W16_H4_avx2 16 +PIXEL_ADD_PS_W16_H4_avx2 32 ;----------------------------------------------------------------------------- @@ -569,11 +624,90 @@ jnz .loop RET +%endif +%endmacro +PIXEL_ADD_PS_W32_H2 32, 32 +PIXEL_ADD_PS_W32_H2 32, 64 +;----------------------------------------------------------------------------- +; void pixel_add_ps_32x32(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1) +;----------------------------------------------------------------------------- +%macro PIXEL_ADD_PS_W32_H4_avx2 1 +%if HIGH_BIT_DEPTH +%if ARCH_X86_64 INIT_YMM avx2 -cglobal pixel_add_ps_32x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1 - mov r6d, %2/4 +cglobal pixel_add_ps_32x%1, 6, 10, 6, dest, destride, src0, scr1, srcStride0, srcStride1 + mova m5, [pw_pixel_max] + pxor m4, m4 + mov r6d, %1/4 + add r4d, r4d + add r5d, r5d + add r1d, r1d + lea r7, [r4 * 3] + lea r8, [r5 * 3] + lea r9, [r1 * 3] + +.loop: + movu m0, [r2] + movu m2, [r2 + 32] + movu m1, [r3] + movu m3, [r3 + 32] + paddw m0, m1 + paddw m2, m3 + CLIPW2 m0, m2, m4, m5 + + movu [r0], m0 + movu [r0 + 32], m2 + + movu m0, [r2 + r4] + movu m2, [r2 + r4 + 32] + movu m1, [r3 + r5] + movu m3, [r3 + r5 + 32] + paddw m0, m1 + paddw m2, m3 + CLIPW2 m0, m2, m4, m5 + + movu [r0 + r1], m0 + movu [r0 + r1 + 32], m2 + + movu m0, [r2 + r4 * 2] + movu m2, [r2 + r4 * 2 + 32] + movu m1, [r3 + r5 * 2] + movu m3, [r3 + r5 * 2 + 32] + paddw m0, m1 + paddw m2, m3 + CLIPW2 m0, m2, m4, m5 + + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m2 + + movu m0, [r2 + r7] + movu m2, [r2 + r7 + 32] + movu m1, [r3 + r8] + movu m3, [r3 + r8 + 32] + paddw m0, m1 + paddw m2, m3 + CLIPW2 m0, m2, m4, m5 + + movu [r0 + r9], m0 + movu [r0 + r9 + 32], m2 + + dec r6d + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + jnz .loop + RET +%endif +%else +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_add_ps_32x%1, 6, 10, 8, dest, destride, src0, scr1, srcStride0, srcStride1 + mov r6d, %1/4 add r5, r5 + lea r7, [r4 * 3] + lea r8, [r5 * 3] + lea r9, [r1 * 3] .loop: pmovzxbw m0, [r2] ; first half of row 0 of src0 pmovzxbw m1, [r2 + 16] ; second half of row 0 of src0 @@ -597,44 +731,41 @@ vpermq m0, m0, 11011000b movu [r0 + r1], m0 ; row 1 of dst - lea r2, [r2 + r4 * 2] - lea r3, [r3 + r5 * 2] - lea r0, [r0 + r1 * 2] - - pmovzxbw m0, [r2] ; first half of row 2 of src0 - pmovzxbw m1, [r2 + 16] ; second half of row 2 of src0 - movu m2, [r3] ; first half of row 2 of src1 - movu m3, [r3 + 32] ; second half of row 2 of src1 + pmovzxbw m0, [r2 + r4 * 2] ; first half of row 2 of src0 + pmovzxbw m1, [r2 + r4 * 2 + 16] ; second half of row 2 of src0 + movu m2, [r3 + r5 * 2] ; first half of row 2 of src1 + movu m3, [r3 + + r5 * 2 + 32]; second half of row 2 of src1 paddw m0, m2 paddw m1, m3 packuswb m0, m1 vpermq m0, m0, 11011000b - movu [r0], m0 ; row 2 of dst + movu [r0 + r1 * 2], m0 ; row 2 of dst - pmovzxbw m0, [r2 + r4] ; first half of row 3 of src0 - pmovzxbw m1, [r2 + r4 + 16] ; second half of row 3 of src0 - movu m2, [r3 + r5] ; first half of row 3 of src1 - movu m3, [r3 + r5 + 32] ; second half of row 3 of src1 + pmovzxbw m0, [r2 + r7] ; first half of row 3 of src0 + pmovzxbw m1, [r2 + r7 + 16] ; second half of row 3 of src0 + movu m2, [r3 + r8] ; first half of row 3 of src1 + movu m3, [r3 + r8 + 32] ; second half of row 3 of src1 paddw m0, m2 paddw m1, m3 packuswb m0, m1 vpermq m0, m0, 11011000b - movu [r0 + r1], m0 ; row 3 of dst + movu [r0 + r9], m0 ; row 3 of dst - lea r2, [r2 + r4 * 2] - lea r3, [r3 + r5 * 2] - lea r0, [r0 + r1 * 2] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] dec r6d jnz .loop RET %endif +%endif %endmacro -PIXEL_ADD_PS_W32_H2 32, 32 -PIXEL_ADD_PS_W32_H2 32, 64 +PIXEL_ADD_PS_W32_H4_avx2 32 +PIXEL_ADD_PS_W32_H4_avx2 64 ;----------------------------------------------------------------------------- @@ -841,10 +972,127 @@ jnz .loop RET +%endif +%endmacro +PIXEL_ADD_PS_W64_H2 64, 64 +;----------------------------------------------------------------------------- +; void pixel_add_ps_64x64(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1) +;----------------------------------------------------------------------------- +%if HIGH_BIT_DEPTH +%if ARCH_X86_64 INIT_YMM avx2 -cglobal pixel_add_ps_64x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1 - mov r6d, %2/2 +cglobal pixel_add_ps_64x64, 6, 10, 6, dest, destride, src0, scr1, srcStride0, srcStride1 + mova m5, [pw_pixel_max] + pxor m4, m4 + mov r6d, 16 + add r4d, r4d + add r5d, r5d + add r1d, r1d + lea r7, [r4 * 3] + lea r8, [r5 * 3] + lea r9, [r1 * 3] + +.loop: + movu m0, [r2] + movu m1, [r2 + 32] + movu m2, [r3] + movu m3, [r3 + 32] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + movu [r0], m0 + movu [r0 + 32], m1 + + movu m0, [r2 + 64] + movu m1, [r2 + 96] + movu m2, [r3 + 64] + movu m3, [r3 + 96] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + movu [r0 + 64], m0 + movu [r0 + 96], m1 + + movu m0, [r2 + r4] + movu m1, [r2 + r4 + 32] + movu m2, [r3 + r5] + movu m3, [r3 + r5 + 32] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + movu [r0 + r1], m0 + movu [r0 + r1 + 32], m1 + + movu m0, [r2 + r4 + 64] + movu m1, [r2 + r4 + 96] + movu m2, [r3 + r5 + 64] + movu m3, [r3 + r5 + 96] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + movu [r0 + r1 + 64], m0 + movu [r0 + r1 + 96], m1 + + movu m0, [r2 + r4 * 2] + movu m1, [r2 + r4 * 2 + 32] + movu m2, [r3 + r5 * 2] + movu m3, [r3 + r5 * 2+ 32] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m1 + + movu m0, [r2 + r4 * 2 + 64] + movu m1, [r2 + r4 * 2 + 96] + movu m2, [r3 + r5 * 2 + 64] + movu m3, [r3 + r5 * 2 + 96] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + movu [r0 + r1 * 2 + 64], m0 + movu [r0 + r1 * 2 + 96], m1 + + movu m0, [r2 + r7] + movu m1, [r2 + r7 + 32] + movu m2, [r3 + r8] + movu m3, [r3 + r8 + 32] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + movu [r0 + r9], m0 + movu [r0 + r9 + 32], m1 + + movu m0, [r2 + r7 + 64] + movu m1, [r2 + r7 + 96] + movu m2, [r3 + r8 + 64] + movu m3, [r3 + r8 + 96] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + movu [r0 + r9 + 64], m0 + movu [r0 + r9 + 96], m1 + + dec r6d + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + jnz .loop + RET +%endif +%else +INIT_YMM avx2 +cglobal pixel_add_ps_64x64, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1 + mov r6d, 32 add r5, r5 .loop: pmovzxbw m0, [r2] ; first 16 of row 0 of src0 @@ -896,6 +1144,3 @@ RET %endif -%endmacro - -PIXEL_ADD_PS_W64_H2 64, 64
View file
x265_1.6.tar.gz/source/common/x86/sad-a.asm -> x265_1.7.tar.gz/source/common/x86/sad-a.asm
Changed
@@ -4004,10 +4004,12 @@ RET INIT_YMM avx2 -cglobal pixel_sad_32x24, 4,5,6 +cglobal pixel_sad_32x24, 4,7,6 xorps m0, m0 xorps m5, m5 mov r4d, 6 + lea r5, [r1 * 3] + lea r6, [r3 * 3] .loop movu m1, [r0] ; row 0 of pix0 movu m2, [r2] ; row 0 of pix1 @@ -4019,21 +4021,18 @@ paddd m0, m1 paddd m5, m3 - lea r2, [r2 + 2 * r3] - lea r0, [r0 + 2 * r1] - - movu m1, [r0] ; row 2 of pix0 - movu m2, [r2] ; row 2 of pix1 - movu m3, [r0 + r1] ; row 3 of pix0 - movu m4, [r2 + r3] ; row 3 of pix1 + movu m1, [r0 + 2 * r1] ; row 2 of pix0 + movu m2, [r2 + 2 * r3] ; row 2 of pix1 + movu m3, [r0 + r5] ; row 3 of pix0 + movu m4, [r2 + r6] ; row 3 of pix1 psadbw m1, m2 psadbw m3, m4 paddd m0, m1 paddd m5, m3 - lea r2, [r2 + 2 * r3] - lea r0, [r0 + 2 * r1] + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] dec r4d jnz .loop @@ -4307,10 +4306,12 @@ RET INIT_YMM avx2 -cglobal pixel_sad_64x48, 4,5,6 +cglobal pixel_sad_64x48, 4,7,6 xorps m0, m0 xorps m5, m5 - mov r4d, 24 + mov r4d, 12 + lea r5, [r1 * 3] + lea r6, [r3 * 3] .loop movu m1, [r0] ; first 32 of row 0 of pix0 movu m2, [r2] ; first 32 of row 0 of pix1 @@ -4332,8 +4333,28 @@ paddd m0, m1 paddd m5, m3 - lea r2, [r2 + 2 * r3] - lea r0, [r0 + 2 * r1] + movu m1, [r0 + 2 * r1] ; first 32 of row 0 of pix0 + movu m2, [r2 + 2 * r3] ; first 32 of row 0 of pix1 + movu m3, [r0 + 2 * r1 + 32] ; second 32 of row 0 of pix0 + movu m4, [r2 + 2 * r3 + 32] ; second 32 of row 0 of pix1 + + psadbw m1, m2 + psadbw m3, m4 + paddd m0, m1 + paddd m5, m3 + + movu m1, [r0 + r5] ; first 32 of row 1 of pix0 + movu m2, [r2 + r6] ; first 32 of row 1 of pix1 + movu m3, [r0 + 32 + r5] ; second 32 of row 1 of pix0 + movu m4, [r2 + 32 + r6] ; second 32 of row 1 of pix1 + + psadbw m1, m2 + psadbw m3, m4 + paddd m0, m1 + paddd m5, m3 + + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] dec r4d jnz .loop @@ -4347,10 +4368,12 @@ RET INIT_YMM avx2 -cglobal pixel_sad_64x64, 4,5,6 +cglobal pixel_sad_64x64, 4,7,6 xorps m0, m0 xorps m5, m5 mov r4d, 8 + lea r5, [r1 * 3] + lea r6, [r3 * 3] .loop movu m1, [r0] ; first 32 of row 0 of pix0 movu m2, [r2] ; first 32 of row 0 of pix1 @@ -4372,31 +4395,28 @@ paddd m0, m1 paddd m5, m3 - lea r2, [r2 + 2 * r3] - lea r0, [r0 + 2 * r1] - - movu m1, [r0] ; first 32 of row 2 of pix0 - movu m2, [r2] ; first 32 of row 2 of pix1 - movu m3, [r0 + 32] ; second 32 of row 2 of pix0 - movu m4, [r2 + 32] ; second 32 of row 2 of pix1 + movu m1, [r0 + 2 * r1] ; first 32 of row 2 of pix0 + movu m2, [r2 + 2 * r3] ; first 32 of row 2 of pix1 + movu m3, [r0 + 2 * r1 + 32] ; second 32 of row 2 of pix0 + movu m4, [r2 + 2 * r3 + 32] ; second 32 of row 2 of pix1 psadbw m1, m2 psadbw m3, m4 paddd m0, m1 paddd m5, m3 - movu m1, [r0 + r1] ; first 32 of row 3 of pix0 - movu m2, [r2 + r3] ; first 32 of row 3 of pix1 - movu m3, [r0 + 32 + r1] ; second 32 of row 3 of pix0 - movu m4, [r2 + 32 + r3] ; second 32 of row 3 of pix1 + movu m1, [r0 + r5] ; first 32 of row 3 of pix0 + movu m2, [r2 + r6] ; first 32 of row 3 of pix1 + movu m3, [r0 + 32 + r5] ; second 32 of row 3 of pix0 + movu m4, [r2 + 32 + r6] ; second 32 of row 3 of pix1 psadbw m1, m2 psadbw m3, m4 paddd m0, m1 paddd m5, m3 - lea r2, [r2 + 2 * r3] - lea r0, [r0 + 2 * r1] + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] movu m1, [r0] ; first 32 of row 4 of pix0 movu m2, [r2] ; first 32 of row 4 of pix1 @@ -4418,31 +4438,28 @@ paddd m0, m1 paddd m5, m3 - lea r2, [r2 + 2 * r3] - lea r0, [r0 + 2 * r1] - - movu m1, [r0] ; first 32 of row 6 of pix0 - movu m2, [r2] ; first 32 of row 6 of pix1 - movu m3, [r0 + 32] ; second 32 of row 6 of pix0 - movu m4, [r2 + 32] ; second 32 of row 6 of pix1 + movu m1, [r0 + 2 * r1] ; first 32 of row 6 of pix0 + movu m2, [r2 + 2 * r3] ; first 32 of row 6 of pix1 + movu m3, [r0 + 2 * r1 + 32] ; second 32 of row 6 of pix0 + movu m4, [r2 + 2 * r3 + 32] ; second 32 of row 6 of pix1 psadbw m1, m2 psadbw m3, m4 paddd m0, m1 paddd m5, m3 - movu m1, [r0 + r1] ; first 32 of row 7 of pix0 - movu m2, [r2 + r3] ; first 32 of row 7 of pix1 - movu m3, [r0 + 32 + r1] ; second 32 of row 7 of pix0 - movu m4, [r2 + 32 + r3] ; second 32 of row 7 of pix1 + movu m1, [r0 + r5] ; first 32 of row 7 of pix0 + movu m2, [r2 + r6] ; first 32 of row 7 of pix1 + movu m3, [r0 + 32 + r5] ; second 32 of row 7 of pix0 + movu m4, [r2 + 32 + r6] ; second 32 of row 7 of pix1 psadbw m1, m2 psadbw m3, m4 paddd m0, m1 paddd m5, m3 - lea r2, [r2 + 2 * r3] - lea r0, [r0 + 2 * r1] + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] dec r4d jnz .loop
View file
x265_1.6.tar.gz/source/common/x86/sad16-a.asm -> x265_1.7.tar.gz/source/common/x86/sad16-a.asm
Changed
@@ -276,9 +276,8 @@ ABSW2 m3, m4, m3, m4, m7, m5 paddw m1, m2 paddw m3, m4 - paddw m3, m1 - pmaddwd m3, [pw_1] - paddd m0, m3 + paddw m0, m1 + paddw m0, m3 %else movu m1, [r2] movu m2, [r2+2*r3] @@ -287,15 +286,45 @@ ABSW2 m1, m2, m1, m2, m3, m4 lea r0, [r0+4*r1] lea r2, [r2+4*r3] - paddw m2, m1 - pmaddwd m2, [pw_1] - paddd m0, m2 + paddw m0, m1 + paddw m0, m2 %endif %endmacro -;----------------------------------------------------------------------------- -; int pixel_sad_NxM( uint16_t *, intptr_t, uint16_t *, intptr_t ) -;----------------------------------------------------------------------------- +%macro SAD_INC_2ROW_Nx64 1 +%if 2*%1 > mmsize + movu m1, [r2 + 0] + movu m2, [r2 + 16] + movu m3, [r2 + 2 * r3 + 0] + movu m4, [r2 + 2 * r3 + 16] + psubw m1, [r0 + 0] + psubw m2, [r0 + 16] + psubw m3, [r0 + 2 * r1 + 0] + psubw m4, [r0 + 2 * r1 + 16] + ABSW2 m1, m2, m1, m2, m5, m6 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + ABSW2 m3, m4, m3, m4, m7, m5 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m8, m3 +%else + movu m1, [r2] + movu m2, [r2 + 2 * r3] + psubw m1, [r0] + psubw m2, [r0 + 2 * r1] + ABSW2 m1, m2, m1, m2, m3, m4 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + paddw m0, m1 + paddw m8, m2 +%endif +%endmacro + +; ---------------------------------------------------------------------------- - +; int pixel_sad_NxM(uint16_t *, intptr_t, uint16_t *, intptr_t) +; ---------------------------------------------------------------------------- - %macro SAD 2 cglobal pixel_sad_%1x%2, 4,5-(%2&4/4),8*(%1/mmsize) pxor m0, m0 @@ -309,8 +338,35 @@ dec r4d jg .loop %endif +%if %2 == 32 + HADDUWD m0, m1 + HADDD m0, m1 +%else + HADDW m0, m1 +%endif + movd eax, xm0 + RET +%endmacro +; ---------------------------------------------------------------------------- - +; int pixel_sad_Nx64(uint16_t *, intptr_t, uint16_t *, intptr_t) +; ---------------------------------------------------------------------------- - +%macro SAD_Nx64 1 +cglobal pixel_sad_%1x64, 4,5-(64&4/4), 9 + pxor m0, m0 + pxor m8, m8 + mov r4d, 64 / 2 +.loop: + SAD_INC_2ROW_Nx64 %1 + dec r4d + jg .loop + + HADDUWD m0, m1 + HADDUWD m8, m1 HADDD m0, m1 + HADDD m8, m1 + paddd m0, m8 + movd eax, xm0 RET %endmacro @@ -321,7 +377,7 @@ SAD 16, 12 SAD 16, 16 SAD 16, 32 -SAD 16, 64 +SAD_Nx64 16 INIT_XMM sse2 SAD 8, 4 @@ -329,6 +385,13 @@ SAD 8, 16 SAD 8, 32 +INIT_YMM avx2 +SAD 16, 4 +SAD 16, 8 +SAD 16, 12 +SAD 16, 16 +SAD 16, 32 + ;------------------------------------------------------------------ ; int pixel_sad_32xN( uint16_t *, intptr_t, uint16_t *, intptr_t ) ;------------------------------------------------------------------ @@ -716,7 +779,6 @@ %endif movd eax, xm0 RET - ;----------------------------------------------------------------------------- ; void pixel_sad_xN_WxH( uint16_t *fenc, uint16_t *pix0, uint16_t *pix1, ; uint16_t *pix2, intptr_t i_stride, int scores[3] )
View file
x265_1.6.tar.gz/source/common/x86/x86inc.asm -> x265_1.7.tar.gz/source/common/x86/x86inc.asm
Changed
@@ -72,7 +72,7 @@ %define mangle(x) x %endif -%macro SECTION_RODATA 0-1 16 +%macro SECTION_RODATA 0-1 32 SECTION .rodata align=%1 %endmacro @@ -715,6 +715,7 @@ %else global %1 %endif + ALIGN 32 %1: %2 %endmacro
View file
x265_1.6.tar.gz/source/encoder/CMakeLists.txt -> x265_1.7.tar.gz/source/encoder/CMakeLists.txt
Changed
@@ -1,7 +1,11 @@ # vim: syntax=cmake if(GCC) - add_definitions(-Wno-uninitialized) + add_definitions(-Wno-uninitialized) + if(CC_HAS_NO_STRICT_OVERFLOW) + # GCC 4.9.2 gives warnings we know we can ignore in this file + set_source_files_properties(slicetype.cpp PROPERTIES COMPILE_FLAGS -Wno-strict-overflow) + endif(CC_HAS_NO_STRICT_OVERFLOW) endif() if(MSVC) add_definitions(/wd4701) # potentially uninitialized local variable 'foo' used
View file
x265_1.6.tar.gz/source/encoder/analysis.cpp -> x265_1.7.tar.gz/source/encoder/analysis.cpp
Changed
@@ -130,9 +130,12 @@ for (uint32_t i = 0; i <= g_maxCUDepth; i++) for (uint32_t j = 0; j < MAX_PRED_TYPES; j++) m_modeDepth[i].pred[j].invalidate(); -#endif invalidateContexts(0); - m_quant.setQPforQuant(ctu); +#endif + + int qp = setLambdaFromQP(ctu, m_slice->m_pps->bUseDQP ? calculateQpforCuSize(ctu, cuGeom) : m_slice->m_sliceQp); + ctu.setQPSubParts((int8_t)qp, 0, 0); + m_rqt[0].cur.load(initialContext); m_modeDepth[0].fencYuv.copyFromPicYuv(*m_frame->m_fencPic, ctu.m_cuAddr, 0); @@ -140,11 +143,11 @@ if (m_param->analysisMode) { if (m_slice->m_sliceType == I_SLICE) - m_reuseIntraDataCTU = (analysis_intra_data *)m_frame->m_analysisData.intraData; + m_reuseIntraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData; else { int numPredDir = m_slice->isInterP() ? 1 : 2; - m_reuseInterDataCTU = (analysis_inter_data *)m_frame->m_analysisData.interData; + m_reuseInterDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData; m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir]; m_reuseBestMergeCand = &m_reuseInterDataCTU->bestMergeCand[ctu.m_cuAddr * CUGeom::MAX_GEOMS]; } @@ -155,10 +158,10 @@ uint32_t zOrder = 0; if (m_slice->m_sliceType == I_SLICE) { - compressIntraCU(ctu, cuGeom, zOrder); + compressIntraCU(ctu, cuGeom, zOrder, qp); if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_frame->m_analysisData.intraData) { - CUData *bestCU = &m_modeDepth[0].bestMode->cu; + CUData* bestCU = &m_modeDepth[0].bestMode->cu; memcpy(&m_reuseIntraDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition); memcpy(&m_reuseIntraDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_lumaIntraDir, sizeof(uint8_t) * numPartition); memcpy(&m_reuseIntraDataCTU->partSizes[ctu.m_cuAddr * numPartition], bestCU->m_partSize, sizeof(uint8_t) * numPartition); @@ -173,21 +176,21 @@ * they are available for intra predictions */ m_modeDepth[0].fencYuv.copyToPicYuv(*m_frame->m_reconPic, ctu.m_cuAddr, 0); - compressInterCU_rd0_4(ctu, cuGeom); + compressInterCU_rd0_4(ctu, cuGeom, qp); /* generate residual for entire CTU at once and copy to reconPic */ encodeResidue(ctu, cuGeom); } else if (m_param->bDistributeModeAnalysis && m_param->rdLevel >= 2) - compressInterCU_dist(ctu, cuGeom); + compressInterCU_dist(ctu, cuGeom, qp); else if (m_param->rdLevel <= 4) - compressInterCU_rd0_4(ctu, cuGeom); + compressInterCU_rd0_4(ctu, cuGeom, qp); else { - compressInterCU_rd5_6(ctu, cuGeom, zOrder); + compressInterCU_rd5_6(ctu, cuGeom, zOrder, qp); if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_frame->m_analysisData.interData) { - CUData *bestCU = &m_modeDepth[0].bestMode->cu; + CUData* bestCU = &m_modeDepth[0].bestMode->cu; memcpy(&m_reuseInterDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition); memcpy(&m_reuseInterDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_predMode, sizeof(uint8_t) * numPartition); } @@ -206,24 +209,28 @@ return; else if (md.bestMode->cu.isIntra(0)) { + m_quant.m_tqBypass = true; md.pred[PRED_LOSSLESS].initCosts(); md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom); PartSize size = (PartSize)md.pred[PRED_LOSSLESS].cu.m_partSize[0]; uint8_t* modes = md.pred[PRED_LOSSLESS].cu.m_lumaIntraDir; checkIntra(md.pred[PRED_LOSSLESS], cuGeom, size, modes, NULL); checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth); + m_quant.m_tqBypass = false; } else { + m_quant.m_tqBypass = true; md.pred[PRED_LOSSLESS].initCosts(); md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom); md.pred[PRED_LOSSLESS].predYuv.copyFromYuv(md.bestMode->predYuv); encodeResAndCalcRdInterCU(md.pred[PRED_LOSSLESS], cuGeom); checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth); + m_quant.m_tqBypass = false; } } -void Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t& zOrder) +void Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t& zOrder, int32_t qp) { uint32_t depth = cuGeom.depth; ModeDepth& md = m_modeDepth[depth]; @@ -241,11 +248,9 @@ if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.absPartIdx) { - m_quant.setQPforQuant(parentCTU); - PartSize size = (PartSize)reusePartSizes[zOrder]; Mode& mode = size == SIZE_2Nx2N ? md.pred[PRED_INTRA] : md.pred[PRED_INTRA_NxN]; - mode.cu.initSubCU(parentCTU, cuGeom); + mode.cu.initSubCU(parentCTU, cuGeom, qp); checkIntra(mode, cuGeom, size, &reuseModes[zOrder], &reuseChromaModes[zOrder]); checkBestMode(mode, depth); @@ -262,15 +267,13 @@ } else if (mightNotSplit) { - m_quant.setQPforQuant(parentCTU); - - md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL); checkBestMode(md.pred[PRED_INTRA], depth); if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3) { - md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp); checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL); checkBestMode(md.pred[PRED_INTRA_NxN], depth); } @@ -287,12 +290,13 @@ Mode* splitPred = &md.pred[PRED_SPLIT]; splitPred->initCosts(); CUData* splitCU = &splitPred->cu; - splitCU->initSubCU(parentCTU, cuGeom); + splitCU->initSubCU(parentCTU, cuGeom, qp); uint32_t nextDepth = depth + 1; ModeDepth& nd = m_modeDepth[nextDepth]; invalidateContexts(nextDepth); Entropy* nextContext = &m_rqt[depth].cur; + int32_t nextQP = qp; for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++) { @@ -301,7 +305,11 @@ { m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx); m_rqt[nextDepth].cur.load(*nextContext); - compressIntraCU(parentCTU, childGeom, zOrder); + + if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) + nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); + + compressIntraCU(parentCTU, childGeom, zOrder, nextQP); // Save best CU and pred data for this sub CU splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx); @@ -322,7 +330,7 @@ else updateModeCost(*splitPred); - checkDQPForSplitPred(splitPred->cu, cuGeom); + checkDQPForSplitPred(*splitPred, cuGeom); checkBestMode(*splitPred, depth); } @@ -362,24 +370,18 @@ } ModeDepth& md = m_modeDepth[pmode.cuGeom.depth]; - bool bMergeOnly = pmode.cuGeom.log2CUSize == 6; /* setup slave Analysis */ if (&slave != this) { slave.m_slice = m_slice; slave.m_frame = m_frame; - slave.setQP(*m_slice, m_rdCost.m_qp); + slave.m_param = m_param; + slave.setLambdaFromQP(md.pred[PRED_2Nx2N].cu, m_rdCost.m_qp); slave.invalidateContexts(0); - - if (m_param->rdLevel >= 5) - { - slave.m_rqt[pmode.cuGeom.depth].cur.load(m_rqt[pmode.cuGeom.depth].cur); - slave.m_quant.setQPforQuant(md.pred[PRED_2Nx2N].cu); - } + slave.m_rqt[pmode.cuGeom.depth].cur.load(m_rqt[pmode.cuGeom.depth].cur); } - /* perform Mode task, repeat until no more work is available */ do { @@ -388,8 +390,6 @@ switch (pmode.modes[task]) { case PRED_INTRA: - if (&slave != this) - slave.m_rqt[pmode.cuGeom.depth].cur.load(m_rqt[pmode.cuGeom.depth].cur); slave.checkIntraInInter(md.pred[PRED_INTRA], pmode.cuGeom); if (m_param->rdLevel > 2) slave.encodeIntraInInter(md.pred[PRED_INTRA], pmode.cuGeom); @@ -441,7 +441,7 @@ break; case PRED_2Nx2N: - slave.checkInter_rd5_6(md.pred[PRED_2Nx2N], pmode.cuGeom, SIZE_2Nx2N, false); + slave.checkInter_rd5_6(md.pred[PRED_2Nx2N], pmode.cuGeom, SIZE_2Nx2N); md.pred[PRED_BIDIR].rdCost = MAX_INT64; if (m_slice->m_sliceType == B_SLICE) { @@ -452,27 +452,27 @@ break; case PRED_Nx2N: - slave.checkInter_rd5_6(md.pred[PRED_Nx2N], pmode.cuGeom, SIZE_Nx2N, false); + slave.checkInter_rd5_6(md.pred[PRED_Nx2N], pmode.cuGeom, SIZE_Nx2N); break; case PRED_2NxN: - slave.checkInter_rd5_6(md.pred[PRED_2NxN], pmode.cuGeom, SIZE_2NxN, false); + slave.checkInter_rd5_6(md.pred[PRED_2NxN], pmode.cuGeom, SIZE_2NxN); break; case PRED_2NxnU: - slave.checkInter_rd5_6(md.pred[PRED_2NxnU], pmode.cuGeom, SIZE_2NxnU, bMergeOnly); + slave.checkInter_rd5_6(md.pred[PRED_2NxnU], pmode.cuGeom, SIZE_2NxnU); break; case PRED_2NxnD: - slave.checkInter_rd5_6(md.pred[PRED_2NxnD], pmode.cuGeom, SIZE_2NxnD, bMergeOnly); + slave.checkInter_rd5_6(md.pred[PRED_2NxnD], pmode.cuGeom, SIZE_2NxnD); break; case PRED_nLx2N: - slave.checkInter_rd5_6(md.pred[PRED_nLx2N], pmode.cuGeom, SIZE_nLx2N, bMergeOnly); + slave.checkInter_rd5_6(md.pred[PRED_nLx2N], pmode.cuGeom, SIZE_nLx2N); break; case PRED_nRx2N: - slave.checkInter_rd5_6(md.pred[PRED_nRx2N], pmode.cuGeom, SIZE_nRx2N, bMergeOnly); + slave.checkInter_rd5_6(md.pred[PRED_nRx2N], pmode.cuGeom, SIZE_nRx2N); break; default: @@ -490,7 +490,7 @@ while (task >= 0); } -void Analysis::compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom) +void Analysis::compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp) { uint32_t depth = cuGeom.depth; uint32_t cuAddr = parentCTU.m_cuAddr; @@ -505,34 +505,34 @@ if (mightNotSplit && depth >= minDepth) { - int bTryAmp = m_slice->m_sps->maxAMPDepth > depth && (cuGeom.log2CUSize < 6 || m_param->rdLevel > 4); + int bTryAmp = m_slice->m_sps->maxAMPDepth > depth; int bTryIntra = m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames; PMODE pmode(*this, cuGeom); /* Initialize all prediction CUs based on parentCTU */ - md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom); - md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); + md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); if (bTryIntra) { - md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3 && m_param->rdLevel >= 5) - md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_INTRA; } - md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_2Nx2N; - md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_2Nx2N; + md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom, qp); if (m_param->bEnableRectInter) { - md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_2NxN; - md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_Nx2N; + md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_2NxN; + md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_Nx2N; } if (bTryAmp) { - md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_2NxnU; - md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_2NxnD; - md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_nLx2N; - md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_nRx2N; + md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_2NxnU; + md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_2NxnD; + md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_nLx2N; + md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_nRx2N; } pmode.tryBondPeers(*m_frame->m_encData->m_jobProvider, pmode.m_jobTotal); @@ -662,7 +662,7 @@ if (md.bestMode->rdCost == MAX_INT64 && !bTryIntra) { - md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); checkIntraInInter(md.pred[PRED_INTRA], cuGeom); encodeIntraInInter(md.pred[PRED_INTRA], cuGeom); checkBestMode(md.pred[PRED_INTRA], depth); @@ -688,12 +688,13 @@ Mode* splitPred = &md.pred[PRED_SPLIT]; splitPred->initCosts(); CUData* splitCU = &splitPred->cu; - splitCU->initSubCU(parentCTU, cuGeom); + splitCU->initSubCU(parentCTU, cuGeom, qp); uint32_t nextDepth = depth + 1; ModeDepth& nd = m_modeDepth[nextDepth]; invalidateContexts(nextDepth); Entropy* nextContext = &m_rqt[depth].cur; + int nextQP = qp; for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++) { @@ -702,7 +703,11 @@ { m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx); m_rqt[nextDepth].cur.load(*nextContext); - compressInterCU_dist(parentCTU, childGeom); + + if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) + nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); + + compressInterCU_dist(parentCTU, childGeom, nextQP); // Save best CU and pred data for this sub CU splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx); @@ -721,7 +726,7 @@ else updateModeCost(*splitPred); - checkDQPForSplitPred(splitPred->cu, cuGeom); + checkDQPForSplitPred(*splitPred, cuGeom); checkBestMode(*splitPred, depth); } @@ -741,7 +746,7 @@ md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, cuAddr, cuGeom.absPartIdx); } -void Analysis::compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom) +void Analysis::compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp) { uint32_t depth = cuGeom.depth; uint32_t cuAddr = parentCTU.m_cuAddr; @@ -757,8 +762,8 @@ bool bTryIntra = m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames; /* Compute Merge Cost */ - md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom); - md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); + md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom); bool earlyskip = false; @@ -767,30 +772,30 @@ if (!earlyskip) { - md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); checkInter_rd0_4(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N); if (m_slice->m_sliceType == B_SLICE) { - md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom, qp); checkBidir2Nx2N(md.pred[PRED_2Nx2N], md.pred[PRED_BIDIR], cuGeom); } Mode *bestInter = &md.pred[PRED_2Nx2N]; if (m_param->bEnableRectInter) { - md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); checkInter_rd0_4(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N); if (md.pred[PRED_Nx2N].sa8dCost < bestInter->sa8dCost) bestInter = &md.pred[PRED_Nx2N]; - md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); checkInter_rd0_4(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN); if (md.pred[PRED_2NxN].sa8dCost < bestInter->sa8dCost) bestInter = &md.pred[PRED_2NxN]; } - if (m_slice->m_sps->maxAMPDepth > depth && cuGeom.log2CUSize < 6) + if (m_slice->m_sps->maxAMPDepth > depth) { bool bHor = false, bVer = false; if (bestInter->cu.m_partSize[0] == SIZE_2NxN) @@ -806,24 +811,24 @@ if (bHor) { - md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); checkInter_rd0_4(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU); if (md.pred[PRED_2NxnU].sa8dCost < bestInter->sa8dCost) bestInter = &md.pred[PRED_2NxnU]; - md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); checkInter_rd0_4(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD); if (md.pred[PRED_2NxnD].sa8dCost < bestInter->sa8dCost) bestInter = &md.pred[PRED_2NxnD]; } if (bVer) { - md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); checkInter_rd0_4(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N); if (md.pred[PRED_nLx2N].sa8dCost < bestInter->sa8dCost) bestInter = &md.pred[PRED_nLx2N]; - md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); checkInter_rd0_4(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N); if (md.pred[PRED_nRx2N].sa8dCost < bestInter->sa8dCost) bestInter = &md.pred[PRED_nRx2N]; @@ -855,7 +860,7 @@ if ((bTryIntra && md.bestMode->cu.getQtRootCbf(0)) || md.bestMode->sa8dCost == MAX_INT64) { - md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); checkIntraInInter(md.pred[PRED_INTRA], cuGeom); encodeIntraInInter(md.pred[PRED_INTRA], cuGeom); checkBestMode(md.pred[PRED_INTRA], depth); @@ -873,7 +878,7 @@ if (bTryIntra || md.bestMode->sa8dCost == MAX_INT64) { - md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); checkIntraInInter(md.pred[PRED_INTRA], cuGeom); if (md.pred[PRED_INTRA].sa8dCost < md.bestMode->sa8dCost) md.bestMode = &md.pred[PRED_INTRA]; @@ -901,7 +906,6 @@ { /* generate recon pixels with no rate distortion considerations */ CUData& cu = md.bestMode->cu; - m_quant.setQPforQuant(cu); uint32_t tuDepthRange[2]; cu.getInterTUQtDepthRange(tuDepthRange, 0); @@ -926,7 +930,6 @@ { /* generate recon pixels with no rate distortion considerations */ CUData& cu = md.bestMode->cu; - m_quant.setQPforQuant(cu); uint32_t tuDepthRange[2]; cu.getIntraTUQtDepthRange(tuDepthRange, 0); @@ -960,12 +963,13 @@ Mode* splitPred = &md.pred[PRED_SPLIT]; splitPred->initCosts(); CUData* splitCU = &splitPred->cu; - splitCU->initSubCU(parentCTU, cuGeom); + splitCU->initSubCU(parentCTU, cuGeom, qp); uint32_t nextDepth = depth + 1; ModeDepth& nd = m_modeDepth[nextDepth]; invalidateContexts(nextDepth); Entropy* nextContext = &m_rqt[depth].cur; + int nextQP = qp; for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++) { @@ -974,7 +978,11 @@ { m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx); m_rqt[nextDepth].cur.load(*nextContext); - compressInterCU_rd0_4(parentCTU, childGeom); + + if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) + nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); + + compressInterCU_rd0_4(parentCTU, childGeom, nextQP); // Save best CU and pred data for this sub CU splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx); @@ -1006,7 +1014,7 @@ else if (splitPred->sa8dCost < md.bestMode->sa8dCost) md.bestMode = splitPred; - checkDQPForSplitPred(md.bestMode->cu, cuGeom); + checkDQPForSplitPred(*md.bestMode, cuGeom); } if (mightNotSplit) { @@ -1025,7 +1033,7 @@ md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, cuAddr, cuGeom.absPartIdx); } -void Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder) +void Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp) { uint32_t depth = cuGeom.depth; ModeDepth& md = m_modeDepth[depth]; @@ -1040,8 +1048,8 @@ uint8_t* reuseModes = &m_reuseInterDataCTU->modes[parentCTU.m_cuAddr * parentCTU.m_numPartitions]; if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.absPartIdx && reuseModes[zOrder] == MODE_SKIP) { - md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom); - md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); + md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom, true); if (m_bTryLossless) @@ -1060,20 +1068,20 @@ if (mightNotSplit) { - md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom); - md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); + md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom, false); bool earlySkip = m_param->bEnableEarlySkip && md.bestMode && !md.bestMode->cu.getQtRootCbf(0); if (!earlySkip) { - md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom); - checkInter_rd5_6(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N, false); + md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N); checkBestMode(md.pred[PRED_2Nx2N], cuGeom.depth); if (m_slice->m_sliceType == B_SLICE) { - md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom, qp); checkBidir2Nx2N(md.pred[PRED_2Nx2N], md.pred[PRED_BIDIR], cuGeom); if (md.pred[PRED_BIDIR].sa8dCost < MAX_INT64) { @@ -1084,20 +1092,18 @@ if (m_param->bEnableRectInter) { - md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom); - checkInter_rd5_6(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N, false); + md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N); checkBestMode(md.pred[PRED_Nx2N], cuGeom.depth); - md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom); - checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, false); + md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN); checkBestMode(md.pred[PRED_2NxN], cuGeom.depth); } // Try AMP (SIZE_2NxnU, SIZE_2NxnD, SIZE_nLx2N, SIZE_nRx2N) if (m_slice->m_sps->maxAMPDepth > depth) { - bool bMergeOnly = cuGeom.log2CUSize == 6; - bool bHor = false, bVer = false; if (md.bestMode->cu.m_partSize[0] == SIZE_2NxN) bHor = true; @@ -1111,35 +1117,35 @@ if (bHor) { - md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom); - checkInter_rd5_6(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU, bMergeOnly); + md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU); checkBestMode(md.pred[PRED_2NxnU], cuGeom.depth); - md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom); - checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, bMergeOnly); + md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD); checkBestMode(md.pred[PRED_2NxnD], cuGeom.depth); } if (bVer) { - md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom); - checkInter_rd5_6(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N, bMergeOnly); + md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N); checkBestMode(md.pred[PRED_nLx2N], cuGeom.depth); - md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom); - checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, bMergeOnly); + md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N); checkBestMode(md.pred[PRED_nRx2N], cuGeom.depth); } } if (m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames) { - md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL); checkBestMode(md.pred[PRED_INTRA], depth); if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3) { - md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom); + md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp); checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL); checkBestMode(md.pred[PRED_INTRA_NxN], depth); } @@ -1159,12 +1165,13 @@ Mode* splitPred = &md.pred[PRED_SPLIT]; splitPred->initCosts(); CUData* splitCU = &splitPred->cu; - splitCU->initSubCU(parentCTU, cuGeom); + splitCU->initSubCU(parentCTU, cuGeom, qp); uint32_t nextDepth = depth + 1; ModeDepth& nd = m_modeDepth[nextDepth]; invalidateContexts(nextDepth); Entropy* nextContext = &m_rqt[depth].cur; + int nextQP = qp; for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++) { @@ -1173,7 +1180,11 @@ { m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx); m_rqt[nextDepth].cur.load(*nextContext); - compressInterCU_rd5_6(parentCTU, childGeom, zOrder); + + if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) + nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); + + compressInterCU_rd5_6(parentCTU, childGeom, zOrder, nextQP); // Save best CU and pred data for this sub CU splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx); @@ -1193,7 +1204,7 @@ else updateModeCost(*splitPred); - checkDQPForSplitPred(splitPred->cu, cuGeom); + checkDQPForSplitPred(*splitPred, cuGeom); checkBestMode(*splitPred, depth); } @@ -1308,7 +1319,7 @@ md.bestMode->cu.setPUMv(1, candMvField[bestSadCand][1].mv, 0, 0); md.bestMode->cu.setPURefIdx(0, (int8_t)candMvField[bestSadCand][0].refIdx, 0, 0); md.bestMode->cu.setPURefIdx(1, (int8_t)candMvField[bestSadCand][1].refIdx, 0, 0); - checkDQP(md.bestMode->cu, cuGeom); + checkDQP(*md.bestMode, cuGeom); X265_CHECK(md.bestMode->ok(), "Merge mode not ok\n"); } @@ -1440,7 +1451,7 @@ bestPred->cu.setPUMv(1, candMvField[bestCand][1].mv, 0, 0); bestPred->cu.setPURefIdx(0, (int8_t)candMvField[bestCand][0].refIdx, 0, 0); bestPred->cu.setPURefIdx(1, (int8_t)candMvField[bestCand][1].refIdx, 0, 0); - checkDQP(bestPred->cu, cuGeom); + checkDQP(*bestPred, cuGeom); X265_CHECK(bestPred->ok(), "merge mode is not ok"); } @@ -1472,7 +1483,7 @@ } } - predInterSearch(interMode, cuGeom, false, m_bChromaSa8d); + predInterSearch(interMode, cuGeom, m_bChromaSa8d); /* predInterSearch sets interMode.sa8dBits */ const Yuv& fencYuv = *interMode.fencYuv; @@ -1500,7 +1511,7 @@ } } -void Analysis::checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, bool bMergeOnly) +void Analysis::checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize) { interMode.initCosts(); interMode.cu.setPartSizeSubParts(partSize); @@ -1520,7 +1531,7 @@ } } - predInterSearch(interMode, cuGeom, bMergeOnly, true); + predInterSearch(interMode, cuGeom, true); /* predInterSearch sets interMode.sa8dBits, but this is ignored */ encodeResAndCalcRdInterCU(interMode, cuGeom); @@ -1642,8 +1653,8 @@ uint32_t zcost = zsa8d + m_rdCost.getCost(bits0) + m_rdCost.getCost(bits1); /* refine MVP selection for zero mv, updates: mvp, mvpidx, bits, cost */ - checkBestMVP(inter2Nx2N.amvpCand[0][ref0], mvzero, mvp0, mvpIdx0, bits0, zcost); - checkBestMVP(inter2Nx2N.amvpCand[1][ref1], mvzero, mvp1, mvpIdx1, bits1, zcost); + mvp0 = checkBestMVP(inter2Nx2N.amvpCand[0][ref0], mvzero, mvpIdx0, bits0, zcost); + mvp1 = checkBestMVP(inter2Nx2N.amvpCand[1][ref1], mvzero, mvpIdx1, bits1, zcost); uint32_t zbits = bits0 + bits1 + m_listSelBits[2] - (m_listSelBits[0] + m_listSelBits[1]); zcost = zsa8d + m_rdCost.getCost(zbits); @@ -1697,7 +1708,6 @@ CUData& cu = bestMode->cu; cu.copyFromPic(ctu, cuGeom); - m_quant.setQPforQuant(cu); Yuv& fencYuv = m_modeDepth[cuGeom.depth].fencYuv; if (cuGeom.depth) @@ -1913,37 +1923,39 @@ return false; } -int Analysis::calculateQpforCuSize(CUData& ctu, const CUGeom& cuGeom) +int Analysis::calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom) { - uint32_t ctuAddr = ctu.m_cuAddr; FrameData& curEncData = *m_frame->m_encData; - double qp = curEncData.m_cuStat[ctuAddr].baseQp; - - uint32_t width = m_frame->m_fencPic->m_picWidth; - uint32_t height = m_frame->m_fencPic->m_picHeight; - uint32_t block_x = ctu.m_cuPelX + g_zscanToPelX[cuGeom.absPartIdx]; - uint32_t block_y = ctu.m_cuPelY + g_zscanToPelY[cuGeom.absPartIdx]; - uint32_t maxCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16; - uint32_t blockSize = g_maxCUSize >> cuGeom.depth; - double qp_offset = 0; - uint32_t cnt = 0; - uint32_t idx; + double qp = curEncData.m_cuStat[ctu.m_cuAddr].baseQp; /* Use cuTree offsets if cuTree enabled and frame is referenced, else use AQ offsets */ bool isReferenced = IS_REFERENCED(m_frame); double *qpoffs = (isReferenced && m_param->rc.cuTree) ? m_frame->m_lowres.qpCuTreeOffset : m_frame->m_lowres.qpAqOffset; - - for (uint32_t block_yy = block_y; block_yy < block_y + blockSize && block_yy < height; block_yy += 16) + if (qpoffs) { - for (uint32_t block_xx = block_x; block_xx < block_x + blockSize && block_xx < width; block_xx += 16) + uint32_t width = m_frame->m_fencPic->m_picWidth; + uint32_t height = m_frame->m_fencPic->m_picHeight; + uint32_t block_x = ctu.m_cuPelX + g_zscanToPelX[cuGeom.absPartIdx]; + uint32_t block_y = ctu.m_cuPelY + g_zscanToPelY[cuGeom.absPartIdx]; + uint32_t maxCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16; + uint32_t blockSize = g_maxCUSize >> cuGeom.depth; + double qp_offset = 0; + uint32_t cnt = 0; + uint32_t idx; + + for (uint32_t block_yy = block_y; block_yy < block_y + blockSize && block_yy < height; block_yy += 16) { - idx = ((block_yy / 16) * (maxCols)) + (block_xx / 16); - qp_offset += qpoffs[idx]; - cnt++; + for (uint32_t block_xx = block_x; block_xx < block_x + blockSize && block_xx < width; block_xx += 16) + { + idx = ((block_yy / 16) * (maxCols)) + (block_xx / 16); + qp_offset += qpoffs[idx]; + cnt++; + } } + + qp_offset /= cnt; + qp += qp_offset; } - qp_offset /= cnt; - qp += qp_offset; return x265_clip3(QP_MIN, QP_MAX_MAX, (int)(qp + 0.5)); }
View file
x265_1.6.tar.gz/source/encoder/analysis.h -> x265_1.7.tar.gz/source/encoder/analysis.h
Changed
@@ -109,12 +109,12 @@ uint32_t* m_reuseBestMergeCand; /* full analysis for an I-slice CU */ - void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder); + void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp); /* full analysis for a P or B slice CU */ - void compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom); - void compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom); - void compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder); + void compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp); + void compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp); + void compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp); /* measure merge and skip */ void checkMerge2Nx2N_rd0_4(Mode& skip, Mode& merge, const CUGeom& cuGeom); @@ -122,7 +122,7 @@ /* measure inter options */ void checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize); - void checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, bool bMergeOnly); + void checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize); void checkBidir2Nx2N(Mode& inter2Nx2N, Mode& bidir2Nx2N, const CUGeom& cuGeom); @@ -139,7 +139,7 @@ /* generate residual and recon pixels for an entire CTU recursively (RD0) */ void encodeResidue(const CUData& parentCTU, const CUGeom& cuGeom); - int calculateQpforCuSize(CUData& ctu, const CUGeom& cuGeom); + int calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom); /* check whether current mode is the new best */ inline void checkBestMode(Mode& mode, uint32_t depth)
View file
x265_1.6.tar.gz/source/encoder/api.cpp -> x265_1.7.tar.gz/source/encoder/api.cpp
Changed
@@ -39,9 +39,11 @@ if (!p) return NULL; - x265_param *param = X265_MALLOC(x265_param, 1); - if (!param) - return NULL; + Encoder* encoder = NULL; + x265_param* param = x265_param_alloc(); + x265_param* latestParam = x265_param_alloc(); + if (!param || !latestParam) + goto fail; memcpy(param, p, sizeof(x265_param)); x265_log(param, X265_LOG_INFO, "HEVC encoder version %s\n", x265_version_str); @@ -50,38 +52,44 @@ x265_setup_primitives(param, param->cpuid); if (x265_check_params(param)) - return NULL; + goto fail; if (x265_set_globals(param)) - return NULL; + goto fail; - Encoder *encoder = new Encoder; + encoder = new Encoder; if (!param->rc.bEnableSlowFirstPass) x265_param_apply_fastfirstpass(param); // may change params for auto-detect, etc encoder->configure(param); - // may change rate control and CPB params if (!enforceLevel(*param, encoder->m_vps)) - { - delete encoder; - return NULL; - } + goto fail; // will detect and set profile/tier/level in VPS determineLevel(*param, encoder->m_vps); - encoder->create(); - if (encoder->m_aborted) + if (!param->bAllowNonConformance && encoder->m_vps.ptl.profileIdc == Profile::NONE) { - delete encoder; - return NULL; + x265_log(param, X265_LOG_INFO, "non-conformant bitstreams not allowed (--allow-non-conformance)\n"); + goto fail; } - x265_print_params(param); + encoder->create(); + encoder->m_latestParam = latestParam; + memcpy(latestParam, param, sizeof(x265_param)); + if (encoder->m_aborted) + goto fail; + x265_print_params(param); return encoder; + +fail: + delete encoder; + x265_param_free(param); + x265_param_free(latestParam); + return NULL; } extern "C" @@ -112,6 +120,27 @@ } extern "C" +int x265_encoder_reconfig(x265_encoder* enc, x265_param* param_in) +{ + if (!enc || !param_in) + return -1; + + x265_param save; + Encoder* encoder = static_cast<Encoder*>(enc); + memcpy(&save, encoder->m_latestParam, sizeof(x265_param)); + int ret = encoder->reconfigureParam(encoder->m_latestParam, param_in); + if (ret) + /* reconfigure failed, recover saved param set */ + memcpy(encoder->m_latestParam, &save, sizeof(x265_param)); + else + { + encoder->m_reconfigured = true; + x265_print_reconfigured_params(&save, encoder->m_latestParam); + } + return ret; +} + +extern "C" int x265_encoder_encode(x265_encoder *enc, x265_nal **pp_nal, uint32_t *pi_nal, x265_picture *pic_in, x265_picture *pic_out) { if (!enc) @@ -173,19 +202,22 @@ { Encoder *encoder = static_cast<Encoder*>(enc); - encoder->stop(); + encoder->stopJobs(); encoder->printSummary(); encoder->destroy(); delete encoder; + ATOMIC_DEC(&g_ctuSizeConfigured); } } extern "C" void x265_cleanup(void) { - BitCost::destroy(); - CUData::s_partSet[0] = NULL; /* allow CUData to adjust to new CTU size */ - g_ctuSizeConfigured = 0; + if (!g_ctuSizeConfigured) + { + BitCost::destroy(); + CUData::s_partSet[0] = NULL; /* allow CUData to adjust to new CTU size */ + } } extern "C" @@ -232,6 +264,7 @@ &x265_picture_init, &x265_encoder_open, &x265_encoder_parameters, + &x265_encoder_reconfig, &x265_encoder_headers, &x265_encoder_encode, &x265_encoder_get_stats, @@ -243,11 +276,66 @@ x265_max_bit_depth, }; +typedef const x265_api* (*api_get_func)(int bitDepth); + +#define xstr(s) str(s) +#define str(s) #s + +#if _WIN32 +#define ext ".dll" +#elif MACOS +#include <dlfcn.h> +#define ext ".dylib" +#else +#include <dlfcn.h> +#define ext ".so" +#endif + extern "C" const x265_api* x265_api_get(int bitDepth) { if (bitDepth && bitDepth != X265_DEPTH) - return NULL; + { + const char* libname = NULL; + const char* method = "x265_api_get_" xstr(X265_BUILD); + + if (bitDepth == 12) + libname = "libx265_main12" ext; + else if (bitDepth == 10) + libname = "libx265_main10" ext; + else if (bitDepth == 8) + libname = "libx265_main" ext; + else + return NULL; + + const x265_api* api = NULL; + +#if _WIN32 + HMODULE h = LoadLibraryA(libname); + if (h) + { + api_get_func get = (api_get_func)GetProcAddress(h, method); + if (get) + api = get(0); + } +#else + void* h = dlopen(libname, RTLD_LAZY | RTLD_LOCAL); + if (h) + { + api_get_func get = (api_get_func)dlsym(h, method); + if (get) + api = get(0); + } +#endif + + if (api && bitDepth != api->max_bit_depth) + { + x265_log(NULL, X265_LOG_WARNING, "%s does not support requested bitDepth %d\n", libname, bitDepth); + return NULL; + } + + return api; + } return &libapi; }
View file
x265_1.6.tar.gz/source/encoder/encoder.cpp -> x265_1.7.tar.gz/source/encoder/encoder.cpp
Changed
@@ -58,6 +58,7 @@ Encoder::Encoder() { m_aborted = false; + m_reconfigured = false; m_encodedFrameNum = 0; m_pocLast = -1; m_curEncoder = 0; @@ -73,6 +74,7 @@ m_outputCount = 0; m_csvfpt = NULL; m_param = NULL; + m_latestParam = NULL; m_cuOffsetY = NULL; m_cuOffsetC = NULL; m_buOffsetY = NULL; @@ -106,7 +108,7 @@ bool allowPools = !p->numaPools || strcmp(p->numaPools, "none"); // Trim the thread pool if --wpp, --pme, and --pmode are disabled - if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation) + if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation && !p->lookaheadSlices) allowPools = false; if (!p->frameNumThreads) @@ -140,9 +142,11 @@ x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --pme disabled\n"); if (p->bDistributeModeAnalysis) x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --pmode disabled\n"); + if (p->lookaheadSlices) + x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --lookahead-slices disabled\n"); // disable all pool features if the thread pool is disabled or unusable. - p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = 0; + p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = p->lookaheadSlices = 0; } char buf[128]; @@ -159,7 +163,10 @@ x265_log(p, X265_LOG_INFO, "frame threads / pool features : %d / %s\n", p->frameNumThreads, buf); for (int i = 0; i < m_param->frameNumThreads; i++) + { m_frameEncoder[i] = new FrameEncoder; + m_frameEncoder[i]->m_nalList.m_annexB = !!m_param->bAnnexB; + } if (m_numPools) { @@ -287,15 +294,17 @@ m_aborted |= parseLambdaFile(m_param); m_encodeStartTime = x265_mdate(); + + m_nalList.m_annexB = !!m_param->bAnnexB; } -void Encoder::stop() +void Encoder::stopJobs() { if (m_rateControl) m_rateControl->terminate(); // unblock all blocked RC calls if (m_lookahead) - m_lookahead->stop(); + m_lookahead->stopJobs(); for (int i = 0; i < m_param->frameNumThreads; i++) { @@ -309,7 +318,7 @@ } if (m_threadPool) - m_threadPool->stop(); + m_threadPool->stopWorkers(); } void Encoder::destroy() @@ -358,15 +367,20 @@ if (m_param) { - free((void*)m_param->rc.lambdaFileName); // allocs by strdup - free(m_param->rc.statFileName); - free(m_param->analysisFileName); - free((void*)m_param->scalingLists); - free(m_param->csvfn); - free(m_param->numaPools); + /* release string arguments that were strdup'd */ + free((char*)m_param->rc.lambdaFileName); + free((char*)m_param->rc.statFileName); + free((char*)m_param->analysisFileName); + free((char*)m_param->scalingLists); + free((char*)m_param->csvfn); + free((char*)m_param->numaPools); + free((char*)m_param->masteringDisplayColorVolume); + free((char*)m_param->contentLightLevelInfo); - X265_FREE(m_param); + x265_param_free(m_param); } + + x265_param_free(m_latestParam); } void Encoder::updateVbvPlan(RateControl* rc) @@ -436,7 +450,8 @@ if (m_dpb->m_freeList.empty()) { inFrame = new Frame; - if (inFrame->create(m_param)) + x265_param* p = m_reconfigured? m_latestParam : m_param; + if (inFrame->create(p)) { /* the first PicYuv created is asked to generate the CU and block unit offset * arrays which are then shared with all subsequent PicYuv (orig and recon) @@ -477,7 +492,10 @@ } } else + { inFrame = m_dpb->m_freeList.popBack(); + inFrame->m_lowresInit = false; + } /* Copy input picture into a Frame and PicYuv, send to lookahead */ inFrame->m_fencPic->copyFromPicture(*pic_in, m_sps.conformanceWindow.rightOffset, m_sps.conformanceWindow.bottomOffset); @@ -486,6 +504,7 @@ inFrame->m_userData = pic_in->userData; inFrame->m_pts = pic_in->pts; inFrame->m_forceqp = pic_in->forceqp; + inFrame->m_param = m_reconfigured ? m_latestParam : m_param; if (m_pocLast == 0) m_firstPts = inFrame->m_pts; @@ -717,6 +736,34 @@ return ret; } +int Encoder::reconfigureParam(x265_param* encParam, x265_param* param) +{ + encParam->maxNumReferences = param->maxNumReferences; // never uses more refs than specified in stream headers + encParam->bEnableLoopFilter = param->bEnableLoopFilter; + encParam->deblockingFilterTCOffset = param->deblockingFilterTCOffset; + encParam->deblockingFilterBetaOffset = param->deblockingFilterBetaOffset; + encParam->bEnableFastIntra = param->bEnableFastIntra; + encParam->bEnableEarlySkip = param->bEnableEarlySkip; + encParam->bEnableTemporalMvp = param->bEnableTemporalMvp; + /* Scratch buffer prevents me_range from being increased for esa/tesa + if (param->searchMethod < X265_FULL_SEARCH || param->searchMethod < encParam->searchRange) + encParam->searchRange = param->searchRange; */ + encParam->noiseReductionInter = param->noiseReductionInter; + encParam->noiseReductionIntra = param->noiseReductionIntra; + /* We can't switch out of subme=0 during encoding. */ + if (encParam->subpelRefine) + encParam->subpelRefine = param->subpelRefine; + encParam->rdoqLevel = param->rdoqLevel; + encParam->rdLevel = param->rdLevel; + encParam->bEnableTSkipFast = param->bEnableTSkipFast; + encParam->psyRd = param->psyRd; + encParam->psyRdoq = param->psyRdoq; + encParam->bEnableSignHiding = param->bEnableSignHiding; + encParam->bEnableFastIntra = param->bEnableFastIntra; + encParam->maxTUSize = param->maxTUSize; + return x265_check_params(encParam); +} + void EncStats::addPsnr(double psnrY, double psnrU, double psnrV) { m_psnrSumY += psnrY; @@ -1430,6 +1477,34 @@ bs.writeByteAlignment(); list.serialize(NAL_UNIT_PPS, bs); + if (m_param->masteringDisplayColorVolume) + { + SEIMasteringDisplayColorVolume mdsei; + if (mdsei.parse(m_param->masteringDisplayColorVolume)) + { + bs.resetBits(); + mdsei.write(bs, m_sps); + bs.writeByteAlignment(); + list.serialize(NAL_UNIT_PREFIX_SEI, bs); + } + else + x265_log(m_param, X265_LOG_WARNING, "unable to parse mastering display color volume info\n"); + } + + if (m_param->contentLightLevelInfo) + { + SEIContentLightLevel cllsei; + if (cllsei.parse(m_param->contentLightLevelInfo)) + { + bs.resetBits(); + cllsei.write(bs, m_sps); + bs.writeByteAlignment(); + list.serialize(NAL_UNIT_PREFIX_SEI, bs); + } + else + x265_log(m_param, X265_LOG_WARNING, "unable to parse content light level info\n"); + } + if (m_param->bEmitInfoSEI) { char *opts = x265_param2string(m_param); @@ -1559,7 +1634,8 @@ if (!m_param->bLossless && (m_param->rc.aqMode || bIsVbv)) { pps->bUseDQP = true; - pps->maxCuDQPDepth = 0; /* TODO: make configurable? */ + pps->maxCuDQPDepth = g_log2Size[m_param->maxCUSize] - g_log2Size[m_param->rc.qgSize]; + X265_CHECK(pps->maxCuDQPDepth <= 2, "max CU DQP depth cannot be greater than 2\n"); } else { @@ -1788,6 +1864,23 @@ p->analysisMode = X265_ANALYSIS_OFF; x265_log(p, X265_LOG_WARNING, "Analysis save and load mode not supported for distributed mode analysis\n"); } + + bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0; + if (!m_param->bLossless && (m_param->rc.aqMode || bIsVbv)) + { + if (p->rc.qgSize < X265_MAX(16, p->minCUSize)) + { + p->rc.qgSize = X265_MAX(16, p->minCUSize); + x265_log(p, X265_LOG_WARNING, "QGSize should be greater than or equal to 16 and minCUSize, setting QGSize = %d\n", p->rc.qgSize); + } + if (p->rc.qgSize > p->maxCUSize) + { + p->rc.qgSize = p->maxCUSize; + x265_log(p, X265_LOG_WARNING, "QGSize should be less than or equal to maxCUSize, setting QGSize = %d\n", p->rc.qgSize); + } + } + else + m_param->rc.qgSize = p->maxCUSize; } void Encoder::allocAnalysis(x265_analysis_data* analysis)
View file
x265_1.6.tar.gz/source/encoder/encoder.h -> x265_1.7.tar.gz/source/encoder/encoder.h
Changed
@@ -125,22 +125,26 @@ uint32_t m_numDelayedPic; x265_param* m_param; + x265_param* m_latestParam; RateControl* m_rateControl; Lookahead* m_lookahead; Window m_conformanceWindow; bool m_bZeroLatency; // x265_encoder_encode() returns NALs for the input picture, zero lag bool m_aborted; // fatal error detected + bool m_reconfigured; // reconfigure of encoder detected Encoder(); ~Encoder() {} void create(); - void stop(); + void stopJobs(); void destroy(); int encode(const x265_picture* pic, x265_picture *pic_out); + int reconfigureParam(x265_param* encParam, x265_param* param); + void getStreamHeaders(NALList& list, Entropy& sbacCoder, Bitstream& bs); void fetchStats(x265_stats* stats, size_t statsSizeBytes);
View file
x265_1.6.tar.gz/source/encoder/entropy.cpp -> x265_1.7.tar.gz/source/encoder/entropy.cpp
Changed
@@ -585,7 +585,7 @@ if (ctu.isSkipped(absPartIdx)) { codeMergeIndex(ctu, absPartIdx); - finishCU(ctu, absPartIdx, depth); + finishCU(ctu, absPartIdx, depth, bEncodeDQP); return; } codePredMode(ctu.m_predMode[absPartIdx]); @@ -606,7 +606,7 @@ codeCoeff(ctu, absPartIdx, bEncodeDQP, tuDepthRange); // --- write terminating bit --- - finishCU(ctu, absPartIdx, depth); + finishCU(ctu, absPartIdx, depth, bEncodeDQP); } /* Return bit count of signaling inter mode */ @@ -658,7 +658,7 @@ } /* finish encoding a cu and handle end-of-slice conditions */ -void Entropy::finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth) +void Entropy::finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth, bool bCodeDQP) { const Slice* slice = ctu.m_slice; uint32_t realEndAddress = slice->m_endCUAddr; @@ -672,6 +672,9 @@ bool granularityBoundary = (((rpelx & granularityMask) == 0 || (rpelx == slice->m_sps->picWidthInLumaSamples )) && ((bpely & granularityMask) == 0 || (bpely == slice->m_sps->picHeightInLumaSamples))); + if (slice->m_pps->bUseDQP) + const_cast<CUData&>(ctu).setQPSubParts(bCodeDQP ? ctu.getRefQP(absPartIdx) : ctu.m_qp[absPartIdx], absPartIdx, depth); + if (granularityBoundary) { // Encode slice finish @@ -1141,11 +1144,11 @@ { length = 0; codeNumber = (codeNumber >> absGoRice) - COEF_REMAIN_BIN_REDUCTION; - if (codeNumber != 0) { unsigned long idx; CLZ(idx, codeNumber + 1); length = idx; + X265_CHECK((codeNumber != 0) || (length == 0), "length check failure\n"); codeNumber -= (1 << idx) - 1; } codeNumber = (codeNumber << absGoRice) + codeRemain; @@ -1461,7 +1464,7 @@ //const uint32_t maskPosXY = ((uint32_t)~0 >> (31 - log2TrSize + MLS_CG_LOG2_SIZE)) >> 1; X265_CHECK((uint32_t)((1 << (log2TrSize - MLS_CG_LOG2_SIZE)) - 1) == (((uint32_t)~0 >> (31 - log2TrSize + MLS_CG_LOG2_SIZE)) >> 1), "maskPosXY fault\n"); - scanPosLast = primitives.findPosLast(codingParameters.scan, coeff, coeffSign, coeffFlag, coeffNum, numSig); + scanPosLast = primitives.scanPosLast(codingParameters.scan, coeff, coeffSign, coeffFlag, coeffNum, numSig, g_scan4x4[codingParameters.scanType], trSize); posLast = codingParameters.scan[scanPosLast]; const int lastScanSet = scanPosLast >> MLS_CG_SIZE; @@ -1515,7 +1518,6 @@ uint8_t * const baseCoeffGroupCtx = &m_contextState[OFF_SIG_CG_FLAG_CTX + (bIsLuma ? 0 : NUM_SIG_CG_FLAG_CTX)]; uint8_t * const baseCtx = bIsLuma ? &m_contextState[OFF_SIG_FLAG_CTX] : &m_contextState[OFF_SIG_FLAG_CTX + NUM_SIG_FLAG_CTX_LUMA]; uint32_t c1 = 1; - uint32_t goRiceParam = 0; int scanPosSigOff = scanPosLast - (lastScanSet << MLS_CG_SIZE) - 1; int absCoeff[1 << MLS_CG_SIZE]; int numNonZero = 1; @@ -1529,7 +1531,6 @@ const uint32_t subCoeffFlag = coeffFlag[subSet]; uint32_t scanFlagMask = subCoeffFlag; int subPosBase = subSet << MLS_CG_SIZE; - goRiceParam = 0; if (subSet == lastScanSet) { @@ -1548,7 +1549,7 @@ else { uint32_t sigCoeffGroup = ((sigCoeffGroupFlag64 & cgBlkPosMask) != 0); - uint32_t ctxSig = Quant::getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, codingParameters.log2TrSizeCG); + uint32_t ctxSig = Quant::getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE)); encodeBin(sigCoeffGroup, baseCoeffGroupCtx[ctxSig]); } @@ -1556,7 +1557,8 @@ if (sigCoeffGroupFlag64 & cgBlkPosMask) { X265_CHECK((log2TrSize != 2) || (log2TrSize == 2 && subSet == 0), "log2TrSize and subSet mistake!\n"); - const int patternSigCtx = Quant::calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, codingParameters.log2TrSizeCG); + const int patternSigCtx = Quant::calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE)); + const uint32_t posOffset = (bIsLuma && subSet) ? 3 : 0; static const uint8_t ctxIndMap4x4[16] = { @@ -1566,37 +1568,50 @@ 7, 7, 8, 8 }; // NOTE: [patternSigCtx][posXinSubset][posYinSubset] - static const uint8_t table_cnt[4][4][4] = + static const uint8_t table_cnt[4][SCAN_SET_SIZE] = { // patternSigCtx = 0 { - { 2, 1, 1, 0 }, - { 1, 1, 0, 0 }, - { 1, 0, 0, 0 }, - { 0, 0, 0, 0 }, + 2, 1, 1, 0, + 1, 1, 0, 0, + 1, 0, 0, 0, + 0, 0, 0, 0, }, // patternSigCtx = 1 { - { 2, 1, 0, 0 }, - { 2, 1, 0, 0 }, - { 2, 1, 0, 0 }, - { 2, 1, 0, 0 }, + 2, 2, 2, 2, + 1, 1, 1, 1, + 0, 0, 0, 0, + 0, 0, 0, 0, }, // patternSigCtx = 2 { - { 2, 2, 2, 2 }, - { 1, 1, 1, 1 }, - { 0, 0, 0, 0 }, - { 0, 0, 0, 0 }, + 2, 1, 0, 0, + 2, 1, 0, 0, + 2, 1, 0, 0, + 2, 1, 0, 0, }, // patternSigCtx = 3 { - { 2, 2, 2, 2 }, - { 2, 2, 2, 2 }, - { 2, 2, 2, 2 }, - { 2, 2, 2, 2 }, + 2, 2, 2, 2, + 2, 2, 2, 2, + 2, 2, 2, 2, + 2, 2, 2, 2, } }; + + const int offset = codingParameters.firstSignificanceMapContext; + ALIGN_VAR_32(uint16_t, tmpCoeff[SCAN_SET_SIZE]); + // TODO: accelerate by PABSW + const uint32_t blkPosBase = codingParameters.scan[subPosBase]; + for (int i = 0; i < MLS_CG_SIZE; i++) + { + tmpCoeff[i * MLS_CG_SIZE + 0] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 0]); + tmpCoeff[i * MLS_CG_SIZE + 1] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 1]); + tmpCoeff[i * MLS_CG_SIZE + 2] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 2]); + tmpCoeff[i * MLS_CG_SIZE + 3] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 3]); + } + if (m_bitIf) { if (log2TrSize == 2) @@ -1604,16 +1619,16 @@ uint32_t blkPos, sig, ctxSig; for (; scanPosSigOff >= 0; scanPosSigOff--) { - blkPos = codingParameters.scan[subPosBase + scanPosSigOff]; + blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff]; sig = scanFlagMask & 1; scanFlagMask >>= 1; - X265_CHECK((uint32_t)(coeff[blkPos] != 0) == sig, "sign bit mistake\n"); + X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n"); { ctxSig = ctxIndMap4x4[blkPos]; X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");; encodeBin(sig, baseCtx[ctxSig]); } - absCoeff[numNonZero] = int(abs(coeff[blkPos])); + absCoeff[numNonZero] = tmpCoeff[blkPos]; numNonZero += sig; } } @@ -1621,35 +1636,25 @@ { X265_CHECK((log2TrSize > 2), "log2TrSize must be more than 2 in this path!\n"); - const uint8_t (*tabSigCtx)[4] = table_cnt[(uint32_t)patternSigCtx]; - const int offset = codingParameters.firstSignificanceMapContext; - const uint32_t lumaMask = bIsLuma ? ~0 : 0; - static const uint32_t posXY4Mask[] = {0x024, 0x0CC, 0x39C}; - const uint32_t posGT4Mask = posXY4Mask[log2TrSize - 3] & lumaMask; + const uint8_t *tabSigCtx = table_cnt[(uint32_t)patternSigCtx]; uint32_t blkPos, sig, ctxSig; for (; scanPosSigOff >= 0; scanPosSigOff--) { - blkPos = codingParameters.scan[subPosBase + scanPosSigOff]; - X265_CHECK(blkPos || (subPosBase + scanPosSigOff == 0), "blkPos==0 must be at scan[0]\n"); + blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff]; const uint32_t posZeroMask = (subPosBase + scanPosSigOff) ? ~0 : 0; sig = scanFlagMask & 1; scanFlagMask >>= 1; - X265_CHECK((uint32_t)(coeff[blkPos] != 0) == sig, "sign bit mistake\n"); + X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n"); if (scanPosSigOff != 0 || subSet == 0 || numNonZero) { - const uint32_t posY = blkPos >> log2TrSize; - const uint32_t posOffset = (blkPos & posGT4Mask) ? 3 : 0; - - const uint32_t posXinSubset = blkPos & 3; - const uint32_t posYinSubset = posY & 3; - const uint32_t cnt = tabSigCtx[posXinSubset][posYinSubset] + offset; + const uint32_t cnt = tabSigCtx[blkPos] + offset; ctxSig = (cnt + posOffset) & posZeroMask; - X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");; + X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, codingParameters.scan[subPosBase + scanPosSigOff], bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");; encodeBin(sig, baseCtx[ctxSig]); } - absCoeff[numNonZero] = int(abs(coeff[blkPos])); + absCoeff[numNonZero] = tmpCoeff[blkPos]; numNonZero += sig; } } @@ -1663,19 +1668,26 @@ uint32_t blkPos, sig, ctxSig; for (; scanPosSigOff >= 0; scanPosSigOff--) { - blkPos = codingParameters.scan[subPosBase + scanPosSigOff]; + blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff]; sig = scanFlagMask & 1; scanFlagMask >>= 1; - X265_CHECK((uint32_t)(coeff[blkPos] != 0) == sig, "sign bit mistake\n"); + X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n"); { ctxSig = ctxIndMap4x4[blkPos]; - X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");; + X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, codingParameters.scan[subPosBase + scanPosSigOff], bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");; //encodeBin(sig, baseCtx[ctxSig]); const uint32_t mstate = baseCtx[ctxSig]; - baseCtx[ctxSig] = sbacNext(mstate, sig); - sum += sbacGetEntropyBits(mstate, sig); + const uint32_t mps = mstate & 1; + const uint32_t stateBits = g_entropyStateBits[mstate ^ sig]; + uint32_t nextState = (stateBits >> 23) + mps; + if ((mstate ^ sig) == 1) + nextState = sig; + X265_CHECK(sbacNext(mstate, sig) == nextState, "nextState check failure\n"); + X265_CHECK(sbacGetEntropyBits(mstate, sig) == (stateBits & 0xFFFFFF), "entropyBits check failure\n"); + baseCtx[ctxSig] = (uint8_t)nextState; + sum += stateBits; } - absCoeff[numNonZero] = int(abs(coeff[blkPos])); + absCoeff[numNonZero] = tmpCoeff[blkPos]; numNonZero += sig; } } // end of 4x4 @@ -1683,41 +1695,39 @@ { X265_CHECK((log2TrSize > 2), "log2TrSize must be more than 2 in this path!\n"); - const uint8_t (*tabSigCtx)[4] = table_cnt[(uint32_t)patternSigCtx]; - const int offset = codingParameters.firstSignificanceMapContext; - const uint32_t lumaMask = bIsLuma ? ~0 : 0; - static const uint32_t posXY4Mask[] = {0x024, 0x0CC, 0x39C}; - const uint32_t posGT4Mask = posXY4Mask[log2TrSize - 3] & lumaMask; + const uint8_t *tabSigCtx = table_cnt[(uint32_t)patternSigCtx]; uint32_t blkPos, sig, ctxSig; for (; scanPosSigOff >= 0; scanPosSigOff--) { - blkPos = codingParameters.scan[subPosBase + scanPosSigOff]; - X265_CHECK(blkPos || (subPosBase + scanPosSigOff == 0), "blkPos==0 must be at scan[0]\n"); + blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff]; const uint32_t posZeroMask = (subPosBase + scanPosSigOff) ? ~0 : 0; sig = scanFlagMask & 1; scanFlagMask >>= 1; - X265_CHECK((uint32_t)(coeff[blkPos] != 0) == sig, "sign bit mistake\n"); + X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n"); if (scanPosSigOff != 0 || subSet == 0 || numNonZero) { - const uint32_t posY = blkPos >> log2TrSize; - const uint32_t posOffset = (blkPos & posGT4Mask) ? 3 : 0; - - const uint32_t posXinSubset = blkPos & 3; - const uint32_t posYinSubset = posY & 3; - const uint32_t cnt = tabSigCtx[posXinSubset][posYinSubset] + offset; + const uint32_t cnt = tabSigCtx[blkPos] + offset; ctxSig = (cnt + posOffset) & posZeroMask; - X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");; + X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, codingParameters.scan[subPosBase + scanPosSigOff], bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");; //encodeBin(sig, baseCtx[ctxSig]); const uint32_t mstate = baseCtx[ctxSig]; - baseCtx[ctxSig] = sbacNext(mstate, sig); - sum += sbacGetEntropyBits(mstate, sig); + const uint32_t mps = mstate & 1; + const uint32_t stateBits = g_entropyStateBits[mstate ^ sig]; + uint32_t nextState = (stateBits >> 23) + mps; + if ((mstate ^ sig) == 1) + nextState = sig; + X265_CHECK(sbacNext(mstate, sig) == nextState, "nextState check failure\n"); + X265_CHECK(sbacGetEntropyBits(mstate, sig) == (stateBits & 0xFFFFFF), "entropyBits check failure\n"); + baseCtx[ctxSig] = (uint8_t)nextState; + sum += stateBits; } - absCoeff[numNonZero] = int(abs(coeff[blkPos])); + absCoeff[numNonZero] = tmpCoeff[blkPos]; numNonZero += sig; } } // end of non 4x4 path + sum &= 0xFFFFFF; // update RD cost m_fracBits += sum; @@ -1762,31 +1772,77 @@ if (!c1) { baseCtxMod = bIsLuma ? &m_contextState[OFF_ABS_FLAG_CTX + ctxSet] : &m_contextState[OFF_ABS_FLAG_CTX + NUM_ABS_FLAG_CTX_LUMA + ctxSet]; - if (firstC2FlagIdx != -1) - { - uint32_t symbol = absCoeff[firstC2FlagIdx] > 2; - encodeBin(symbol, baseCtxMod[0]); - } + + X265_CHECK((firstC2FlagIdx != -1), "firstC2FlagIdx check failure\n"); + uint32_t symbol = absCoeff[firstC2FlagIdx] > 2; + encodeBin(symbol, baseCtxMod[0]); } const int hiddenShift = (bHideFirstSign && signHidden) ? 1 : 0; encodeBinsEP((coeffSigns >> hiddenShift), numNonZero - hiddenShift); - int firstCoeff2 = 1; if (!c1 || numNonZero > C1FLAG_NUMBER) { - for (int idx = 0; idx < numNonZero; idx++) + uint32_t goRiceParam = 0; + int firstCoeff2 = 1; + uint32_t baseLevelN = 0x5555AAAA; // 2-bits encode format baseLevel + + if (!m_bitIf) { - int baseLevel = (idx < C1FLAG_NUMBER) ? (2 + firstCoeff2) : 1; + // FastRd path + for (int idx = 0; idx < numNonZero; idx++) + { + int baseLevel = (baseLevelN & 3) | firstCoeff2; + X265_CHECK(baseLevel == ((idx < C1FLAG_NUMBER) ? (2 + firstCoeff2) : 1), "baseLevel check failurr\n"); + baseLevelN >>= 2; + int codeNumber = absCoeff[idx] - baseLevel; - if (absCoeff[idx] >= baseLevel) + if (codeNumber >= 0) + { + //writeCoefRemainExGolomb(absCoeff[idx] - baseLevel, goRiceParam); + uint32_t length = 0; + + codeNumber = ((uint32_t)codeNumber >> goRiceParam) - COEF_REMAIN_BIN_REDUCTION; + if (codeNumber >= 0) + { + { + unsigned long cidx; + CLZ(cidx, codeNumber + 1); + length = cidx; + } + X265_CHECK((codeNumber != 0) || (length == 0), "length check failure\n"); + + codeNumber = (length + length); + } + m_fracBits += (COEF_REMAIN_BIN_REDUCTION + 1 + goRiceParam + codeNumber) << 15; + + if (absCoeff[idx] > (COEF_REMAIN_BIN_REDUCTION << goRiceParam)) + goRiceParam = (goRiceParam + 1) - (goRiceParam >> 2); + X265_CHECK(goRiceParam <= 4, "goRiceParam check failure\n"); + } + if (absCoeff[idx] >= 2) + firstCoeff2 = 0; + } + } + else + { + // Standard path + for (int idx = 0; idx < numNonZero; idx++) { - writeCoefRemainExGolomb(absCoeff[idx] - baseLevel, goRiceParam); - if (absCoeff[idx] > 3 * (1 << goRiceParam)) - goRiceParam = std::min<uint32_t>(goRiceParam + 1, 4); + int baseLevel = (baseLevelN & 3) | firstCoeff2; + X265_CHECK(baseLevel == ((idx < C1FLAG_NUMBER) ? (2 + firstCoeff2) : 1), "baseLevel check failurr\n"); + baseLevelN >>= 2; + + if (absCoeff[idx] >= baseLevel) + { + writeCoefRemainExGolomb(absCoeff[idx] - baseLevel, goRiceParam); + if (absCoeff[idx] > (COEF_REMAIN_BIN_REDUCTION << goRiceParam)) + goRiceParam = (goRiceParam + 1) - (goRiceParam >> 2); + X265_CHECK(goRiceParam <= 4, "goRiceParam check failure\n"); + } + if (absCoeff[idx] >= 2) + firstCoeff2 = 0; } - if (absCoeff[idx] >= 2) - firstCoeff2 = 0; } } } @@ -1874,20 +1930,20 @@ if (bIsLuma) { for (uint32_t bin = 0; bin < 2; bin++) - estBitsSbac.significantBits[0][bin] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX], bin); + estBitsSbac.significantBits[bin][0] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX], bin); for (int ctxIdx = firstCtx; ctxIdx < firstCtx + numCtx; ctxIdx++) for (uint32_t bin = 0; bin < 2; bin++) - estBitsSbac.significantBits[ctxIdx][bin] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + ctxIdx], bin); + estBitsSbac.significantBits[bin][ctxIdx] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + ctxIdx], bin); } else { for (uint32_t bin = 0; bin < 2; bin++) - estBitsSbac.significantBits[0][bin] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + (NUM_SIG_FLAG_CTX_LUMA + 0)], bin); + estBitsSbac.significantBits[bin][0] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + (NUM_SIG_FLAG_CTX_LUMA + 0)], bin); for (int ctxIdx = firstCtx; ctxIdx < firstCtx + numCtx; ctxIdx++) for (uint32_t bin = 0; bin < 2; bin++) - estBitsSbac.significantBits[ctxIdx][bin] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + (NUM_SIG_FLAG_CTX_LUMA + ctxIdx)], bin); + estBitsSbac.significantBits[bin][ctxIdx] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + (NUM_SIG_FLAG_CTX_LUMA + ctxIdx)], bin); } int blkSizeOffset = bIsLuma ? ((log2TrSize - 2) * 3 + ((log2TrSize - 1) >> 2)) : NUM_CTX_LAST_FLAG_XY_LUMA; @@ -2187,6 +2243,28 @@ 0x0050c, 0x29bab, 0x004c1, 0x2a674, 0x004a7, 0x2aa5e, 0x0046f, 0x2b32f, 0x0041f, 0x2c0ad, 0x003e7, 0x2ca8d, 0x003ba, 0x2d323, 0x0010c, 0x3bfbb }; +// [8 24] --> [stateMPS BitCost], [stateLPS BitCost] +const uint32_t g_entropyStateBits[128] = +{ + // Corrected table, most notably for last state + 0x01007b23, 0x000085f9, 0x020074a0, 0x00008cbc, 0x03006ee4, 0x01009354, 0x040067f4, 0x02009c1b, + 0x050060b0, 0x0200a62a, 0x06005a9c, 0x0400af5b, 0x0700548d, 0x0400b955, 0x08004f56, 0x0500c2a9, + 0x09004a87, 0x0600cbf7, 0x0a0045d6, 0x0700d5c3, 0x0b004144, 0x0800e01b, 0x0c003d88, 0x0900e937, + 0x0d0039e0, 0x0900f2cd, 0x0e003663, 0x0b00fc9e, 0x0f003347, 0x0b010600, 0x10003050, 0x0c010f95, + 0x11002d4d, 0x0d011a02, 0x12002ad3, 0x0d012333, 0x1300286e, 0x0f012cad, 0x14002604, 0x0f0136df, + 0x15002425, 0x10013f48, 0x160021f4, 0x100149c4, 0x1700203e, 0x1201527b, 0x18001e4d, 0x12015d00, + 0x19001c99, 0x130166de, 0x1a001b18, 0x13017017, 0x1b0019a5, 0x15017988, 0x1c001841, 0x15018327, + 0x1d0016df, 0x16018d50, 0x1e0015d9, 0x16019547, 0x1f00147c, 0x1701a083, 0x2000138e, 0x1801a8a3, + 0x21001251, 0x1801b418, 0x22001166, 0x1901bd27, 0x23001068, 0x1a01c77b, 0x24000f7f, 0x1a01d18e, + 0x25000eda, 0x1b01d91a, 0x26000e19, 0x1b01e254, 0x27000d4f, 0x1c01ec9a, 0x28000c90, 0x1d01f6e0, + 0x29000c01, 0x1d01fef8, 0x2a000b5f, 0x1e0208b1, 0x2b000ab6, 0x1e021362, 0x2c000a15, 0x1e021e46, + 0x2d000988, 0x1f02285d, 0x2e000934, 0x20022ea8, 0x2f0008a8, 0x200239b2, 0x3000081d, 0x21024577, + 0x310007c9, 0x21024ce6, 0x32000763, 0x21025663, 0x33000710, 0x22025e8f, 0x340006a0, 0x22026a26, + 0x35000672, 0x23026f23, 0x360005e8, 0x23027ef8, 0x370005ba, 0x230284b5, 0x3800055e, 0x24029057, + 0x3900050c, 0x24029bab, 0x3a0004c1, 0x2402a674, 0x3b0004a7, 0x2502aa5e, 0x3c00046f, 0x2502b32f, + 0x3d00041f, 0x2502c0ad, 0x3e0003e7, 0x2602ca8d, 0x3e0003ba, 0x2602d323, 0x3f00010c, 0x3f03bfbb, +}; + const uint8_t g_nextState[128][2] = { { 2, 1 }, { 0, 3 }, { 4, 0 }, { 1, 5 }, { 6, 2 }, { 3, 7 }, { 8, 4 }, { 5, 9 },
View file
x265_1.6.tar.gz/source/encoder/entropy.h -> x265_1.7.tar.gz/source/encoder/entropy.h
Changed
@@ -87,7 +87,7 @@ struct EstBitsSbac { int significantCoeffGroupBits[NUM_SIG_CG_FLAG_CTX][2]; - int significantBits[NUM_SIG_FLAG_CTX][2]; + int significantBits[2][NUM_SIG_FLAG_CTX]; int lastBits[2][10]; int greaterOneBits[NUM_ONE_FLAG_CTX][2]; int levelAbsBits[NUM_ABS_FLAG_CTX][2]; @@ -179,7 +179,7 @@ inline void codeQtCbfChroma(uint32_t cbf, uint32_t tuDepth) { encodeBin(cbf, m_contextState[OFF_QT_CBF_CTX + 2 + tuDepth]); } inline void codeQtRootCbf(uint32_t cbf) { encodeBin(cbf, m_contextState[OFF_QT_ROOT_CBF_CTX]); } inline void codeTransformSkipFlags(uint32_t transformSkip, TextType ttype) { encodeBin(transformSkip, m_contextState[OFF_TRANSFORMSKIP_FLAG_CTX + (ttype ? NUM_TRANSFORMSKIP_FLAG_CTX : 0)]); } - + void codeDeltaQP(const CUData& cu, uint32_t absPartIdx); void codeSaoOffset(const SaoCtuParam& ctuParam, int plane); /* RDO functions */ @@ -221,7 +221,7 @@ } void encodeCU(const CUData& ctu, const CUGeom &cuGeom, uint32_t absPartIdx, uint32_t depth, bool& bEncodeDQP); - void finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth); + void finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth, bool bEncodeDQP); void writeOut(); @@ -242,7 +242,6 @@ void codeSaoMaxUvlc(uint32_t code, uint32_t maxSymbol); - void codeDeltaQP(const CUData& cu, uint32_t absPartIdx); void codeLastSignificantXY(uint32_t posx, uint32_t posy, uint32_t log2TrSize, bool bIsLuma, uint32_t scanIdx); void encodeTransform(const CUData& cu, uint32_t absPartIdx, uint32_t tuDepth, uint32_t log2TrSize,
View file
x265_1.6.tar.gz/source/encoder/frameencoder.cpp -> x265_1.7.tar.gz/source/encoder/frameencoder.cpp
Changed
@@ -213,6 +213,7 @@ { m_slicetypeWaitTime = x265_mdate() - m_prevOutputTime; m_frame = curFrame; + m_param = curFrame->m_param; m_sliceType = curFrame->m_lowres.sliceType; curFrame->m_encData->m_frameEncoderID = m_jpId; curFrame->m_encData->m_jobProvider = this; @@ -794,6 +795,7 @@ uint32_t row = (uint32_t)intRow; CTURow& curRow = m_rows[row]; + tld.analysis.m_param = m_param; if (m_param->bEnableWavefront) { ScopedLock self(curRow.lock); @@ -824,6 +826,13 @@ const uint32_t lineStartCUAddr = row * numCols; bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0; + /* These store the count of inter, intra and skip cus within quad tree structure of each CTU */ + uint32_t qTreeInterCnt[NUM_CU_DEPTH]; + uint32_t qTreeIntraCnt[NUM_CU_DEPTH]; + uint32_t qTreeSkipCnt[NUM_CU_DEPTH]; + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + qTreeIntraCnt[depth] = qTreeInterCnt[depth] = qTreeSkipCnt[depth] = 0; + while (curRow.completed < numCols) { ProfileScopeEvent(encodeCTU); @@ -841,24 +850,34 @@ curEncData.m_rowStat[row].diagQpScale = x265_qp2qScale(curEncData.m_avgQpRc); } + FrameData::RCStatCU& cuStat = curEncData.m_cuStat[cuAddr]; if (row >= col && row && m_vbvResetTriggerRow != intRow) - curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_cuStat[cuAddr - numCols + 1].baseQp; + cuStat.baseQp = curEncData.m_cuStat[cuAddr - numCols + 1].baseQp; else - curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_rowStat[row].diagQp; - } - else - curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_avgQpRc; + cuStat.baseQp = curEncData.m_rowStat[row].diagQp; + + /* TODO: use defines from slicetype.h for lowres block size */ + uint32_t maxBlockCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16; + uint32_t maxBlockRows = (m_frame->m_fencPic->m_picHeight + (16 - 1)) / 16; + uint32_t noOfBlocks = g_maxCUSize / 16; + uint32_t block_y = (cuAddr / curEncData.m_slice->m_sps->numCuInWidth) * noOfBlocks; + uint32_t block_x = (cuAddr * noOfBlocks) - block_y * curEncData.m_slice->m_sps->numCuInWidth; + + cuStat.vbvCost = 0; + cuStat.intraVbvCost = 0; + for (uint32_t h = 0; h < noOfBlocks && block_y < maxBlockRows; h++, block_y++) + { + uint32_t idx = block_x + (block_y * maxBlockCols); - if (m_param->rc.aqMode || bIsVbv) - { - int qp = calcQpForCu(cuAddr, curEncData.m_cuStat[cuAddr].baseQp); - tld.analysis.setQP(*slice, qp); - qp = x265_clip3(QP_MIN, QP_MAX_SPEC, qp); - ctu->setQPSubParts((int8_t)qp, 0, 0); - curEncData.m_rowStat[row].sumQpAq += qp; + for (uint32_t w = 0; w < noOfBlocks && (block_x + w) < maxBlockCols; w++, idx++) + { + cuStat.vbvCost += m_frame->m_lowres.lowresCostForRc[idx] & LOWRES_COST_MASK; + cuStat.intraVbvCost += m_frame->m_lowres.intraCost[idx]; + } + } } else - tld.analysis.setQP(*slice, slice->m_sliceQp); + curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_avgQpRc; if (m_param->bEnableWavefront && !col && row) { @@ -886,7 +905,9 @@ curRow.completed++; if (m_param->bLogCuStats || m_param->rc.bStatWrite) - collectCTUStatistics(*ctu); + curEncData.m_rowStat[row].sumQpAq += collectCTUStatistics(*ctu, qTreeInterCnt, qTreeIntraCnt, qTreeSkipCnt); + else if (m_param->rc.aqMode) + curEncData.m_rowStat[row].sumQpAq += calcCTUQP(*ctu); // copy no. of intra, inter Cu cnt per row into frame stats for 2 pass if (m_param->rc.bStatWrite) @@ -894,18 +915,17 @@ curRow.rowStats.mvBits += best.mvBits; curRow.rowStats.coeffBits += best.coeffBits; curRow.rowStats.miscBits += best.totalBits - (best.mvBits + best.coeffBits); - StatisticLog* log = &m_sliceTypeLog[slice->m_sliceType]; for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) { /* 1 << shift == number of 8x8 blocks at current depth */ int shift = 2 * (g_maxCUDepth - depth); - curRow.rowStats.iCuCnt += log->qTreeIntraCnt[depth] << shift; - curRow.rowStats.pCuCnt += log->qTreeInterCnt[depth] << shift; - curRow.rowStats.skipCuCnt += log->qTreeSkipCnt[depth] << shift; + curRow.rowStats.iCuCnt += qTreeIntraCnt[depth] << shift; + curRow.rowStats.pCuCnt += qTreeInterCnt[depth] << shift; + curRow.rowStats.skipCuCnt += qTreeSkipCnt[depth] << shift; // clear the row cu data from thread local object - log->qTreeIntraCnt[depth] = log->qTreeInterCnt[depth] = log->qTreeSkipCnt[depth] = 0; + qTreeIntraCnt[depth] = qTreeInterCnt[depth] = qTreeSkipCnt[depth] = 0; } } @@ -1075,15 +1095,18 @@ } } + tld.analysis.m_param = NULL; curRow.busy = false; if (ATOMIC_INC(&m_completionCount) == 2 * (int)m_numRows) m_completionEvent.trigger(); } -void FrameEncoder::collectCTUStatistics(CUData& ctu) +/* collect statistics about CU coding decisions, return total QP */ +int FrameEncoder::collectCTUStatistics(const CUData& ctu, uint32_t* qtreeInterCnt, uint32_t* qtreeIntraCnt, uint32_t* qtreeSkipCnt) { StatisticLog* log = &m_sliceTypeLog[ctu.m_slice->m_sliceType]; + int totQP = 0; if (ctu.m_slice->m_sliceType == I_SLICE) { @@ -1094,13 +1117,14 @@ log->totalCu++; log->cntIntra[depth]++; - log->qTreeIntraCnt[depth]++; + qtreeIntraCnt[depth]++; + totQP += ctu.m_qp[absPartIdx] * (ctu.m_numPartitions >> (depth * 2)); if (ctu.m_predMode[absPartIdx] == MODE_NONE) { log->totalCu--; log->cntIntra[depth]--; - log->qTreeIntraCnt[depth]--; + qtreeIntraCnt[depth]--; } else if (ctu.m_partSize[absPartIdx] != SIZE_2Nx2N) { @@ -1124,6 +1148,7 @@ log->totalCu++; log->cntTotalCu[depth]++; + totQP += ctu.m_qp[absPartIdx] * (ctu.m_numPartitions >> (depth * 2)); if (ctu.m_predMode[absPartIdx] == MODE_NONE) { @@ -1134,12 +1159,12 @@ { log->totalCu--; log->cntSkipCu[depth]++; - log->qTreeSkipCnt[depth]++; + qtreeSkipCnt[depth]++; } else if (ctu.isInter(absPartIdx)) { log->cntInter[depth]++; - log->qTreeInterCnt[depth]++; + qtreeInterCnt[depth]++; if (ctu.m_partSize[absPartIdx] < AMP_ID) log->cuInterDistribution[depth][ctu.m_partSize[absPartIdx]]++; @@ -1149,12 +1174,13 @@ else if (ctu.isIntra(absPartIdx)) { log->cntIntra[depth]++; - log->qTreeIntraCnt[depth]++; + qtreeIntraCnt[depth]++; if (ctu.m_partSize[absPartIdx] != SIZE_2Nx2N) { X265_CHECK(ctu.m_log2CUSize[absPartIdx] == 3 && ctu.m_slice->m_sps->quadtreeTULog2MinSize < 3, "Intra NxN found at improbable depth\n"); log->cntIntraNxN++; + log->cntIntra[depth]--; /* TODO: log intra modes at absPartIdx +0 to +3 */ } else if (ctu.m_lumaIntraDir[absPartIdx] > 1) @@ -1164,6 +1190,23 @@ } } } + + return totQP; +} + +/* iterate over coded CUs and determine total QP */ +int FrameEncoder::calcCTUQP(const CUData& ctu) +{ + int totQP = 0; + uint32_t depth = 0, numParts = ctu.m_numPartitions; + + for (uint32_t absPartIdx = 0; absPartIdx < ctu.m_numPartitions; absPartIdx += numParts) + { + depth = ctu.m_cuDepth[absPartIdx]; + numParts = ctu.m_numPartitions >> (depth * 2); + totQP += ctu.m_qp[absPartIdx] * numParts; + } + return totQP; } /* DCT-domain noise reduction / adaptive deadzone from libavcodec */ @@ -1198,55 +1241,6 @@ } } -int FrameEncoder::calcQpForCu(uint32_t ctuAddr, double baseQp) -{ - x265_emms(); - double qp = baseQp; - - FrameData& curEncData = *m_frame->m_encData; - /* clear cuCostsForVbv from when vbv row reset was triggered */ - bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0; - if (bIsVbv) - { - curEncData.m_cuStat[ctuAddr].vbvCost = 0; - curEncData.m_cuStat[ctuAddr].intraVbvCost = 0; - } - - /* Derive qpOffet for each CU by averaging offsets for all 16x16 blocks in the cu. */ - double qp_offset = 0; - uint32_t maxBlockCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16; - uint32_t maxBlockRows = (m_frame->m_fencPic->m_picHeight + (16 - 1)) / 16; - uint32_t noOfBlocks = g_maxCUSize / 16; - uint32_t block_y = (ctuAddr / curEncData.m_slice->m_sps->numCuInWidth) * noOfBlocks; - uint32_t block_x = (ctuAddr * noOfBlocks) - block_y * curEncData.m_slice->m_sps->numCuInWidth; - - /* Use cuTree offsets if cuTree enabled and frame is referenced, else use AQ offsets */ - bool isReferenced = IS_REFERENCED(m_frame); - double *qpoffs = (isReferenced && m_param->rc.cuTree) ? m_frame->m_lowres.qpCuTreeOffset : m_frame->m_lowres.qpAqOffset; - - uint32_t cnt = 0, idx = 0; - for (uint32_t h = 0; h < noOfBlocks && block_y < maxBlockRows; h++, block_y++) - { - for (uint32_t w = 0; w < noOfBlocks && (block_x + w) < maxBlockCols; w++) - { - idx = block_x + w + (block_y * maxBlockCols); - if (m_param->rc.aqMode) - qp_offset += qpoffs[idx]; - if (bIsVbv) - { - curEncData.m_cuStat[ctuAddr].vbvCost += m_frame->m_lowres.lowresCostForRc[idx] & LOWRES_COST_MASK; - curEncData.m_cuStat[ctuAddr].intraVbvCost += m_frame->m_lowres.intraCost[idx]; - } - cnt++; - } - } - - qp_offset /= cnt; - qp += qp_offset; - - return x265_clip3(QP_MIN, QP_MAX_MAX, (int)(qp + 0.5)); -} - Frame *FrameEncoder::getEncodedPicture(NALList& output) { if (m_frame)
View file
x265_1.6.tar.gz/source/encoder/frameencoder.h -> x265_1.7.tar.gz/source/encoder/frameencoder.h
Changed
@@ -63,11 +63,6 @@ uint64_t cntTotalCu[4]; uint64_t totalCu; - /* These states store the count of inter,intra and skip ctus within quad tree structure of each CU */ - uint32_t qTreeInterCnt[4]; - uint32_t qTreeIntraCnt[4]; - uint32_t qTreeSkipCnt[4]; - StatisticLog() { memset(this, 0, sizeof(StatisticLog)); @@ -226,8 +221,8 @@ void encodeSlice(); void threadMain(); - int calcQpForCu(uint32_t cuAddr, double baseQp); - void collectCTUStatistics(CUData& ctu); + int collectCTUStatistics(const CUData& ctu, uint32_t* qtreeInterCnt, uint32_t* qtreeIntraCnt, uint32_t* qtreeSkipCnt); + int calcCTUQP(const CUData& ctu); void noiseReductionUpdate(); /* Called by WaveFront::findJob() */
View file
x265_1.6.tar.gz/source/encoder/level.cpp -> x265_1.7.tar.gz/source/encoder/level.cpp
Changed
@@ -55,15 +55,14 @@ { 35651584, 1069547520, 60000, 240000, 60000, 240000, 8, Level::LEVEL6, "6", 60 }, { 35651584, 2139095040, 120000, 480000, 120000, 480000, 8, Level::LEVEL6_1, "6.1", 61 }, { 35651584, 4278190080U, 240000, 800000, 240000, 800000, 6, Level::LEVEL6_2, "6.2", 62 }, + { MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, 1, Level::LEVEL8_5, "8.5", 85 }, }; /* determine minimum decoder level required to decode the described video */ void determineLevel(const x265_param ¶m, VPS& vps) { vps.maxTempSubLayers = param.bEnableTemporalSubLayers ? 2 : 1; - if (param.bLossless) - vps.ptl.profileIdc = Profile::NONE; - else if (param.internalCsp == X265_CSP_I420) + if (param.internalCsp == X265_CSP_I420) { if (param.internalBitDepth == 8) { @@ -104,7 +103,15 @@ const size_t NumLevels = sizeof(levels) / sizeof(levels[0]); uint32_t i; - for (i = 0; i < NumLevels; i++) + if (param.bLossless) + { + i = 13; + vps.ptl.minCrForLevel = 1; + vps.ptl.maxLumaSrForLevel = MAX_UINT; + vps.ptl.levelIdc = Level::LEVEL8_5; + vps.ptl.tierFlag = Level::MAIN; + } + else for (i = 0; i < NumLevels; i++) { if (lumaSamples > levels[i].maxLumaSamples) continue; @@ -337,31 +344,40 @@ extern "C" int x265_param_apply_profile(x265_param *param, const char *profile) { - if (!profile) + if (!param || !profile) return 0; - if (!strcmp(profile, "main")) - { - /* SPSs shall have chroma_format_idc equal to 1 only */ - param->internalCsp = X265_CSP_I420; #if HIGH_BIT_DEPTH - /* SPSs shall have bit_depth_luma_minus8 equal to 0 only */ - x265_log(param, X265_LOG_ERROR, "Main profile not supported, compiled for Main10.\n"); + if (!strcmp(profile, "main") || !strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp") || !strcmp(profile, "main444-8")) + { + x265_log(param, X265_LOG_ERROR, "%s profile not supported, compiled for Main10.\n", profile); return -1; -#endif } - else if (!strcmp(profile, "main10")) +#else + if (!strcmp(profile, "main10") || !strcmp(profile, "main422-10") || !strcmp(profile, "main444-10")) { - /* SPSs shall have chroma_format_idc equal to 1 only */ - param->internalCsp = X265_CSP_I420; - - /* SPSs shall have bit_depth_luma_minus8 in the range of 0 to 2, inclusive - * this covers all builds of x265, currently */ + x265_log(param, X265_LOG_ERROR, "%s profile not supported, compiled for Main.\n", profile); + return -1; + } +#endif + + if (!strcmp(profile, "main")) + { + if (!(param->internalCsp & X265_CSP_I420)) + { + x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n", + profile, x265_source_csp_names[param->internalCsp]); + return -1; + } } else if (!strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp")) { - /* SPSs shall have chroma_format_idc equal to 1 only */ - param->internalCsp = X265_CSP_I420; + if (!(param->internalCsp & X265_CSP_I420)) + { + x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n", + profile, x265_source_csp_names[param->internalCsp]); + return -1; + } /* SPSs shall have sps_max_dec_pic_buffering_minus1[ sps_max_sub_layers_minus1 ] equal to 0 only */ param->maxNumReferences = 1; @@ -378,25 +394,29 @@ param->rc.cuTree = 0; param->bEnableWeightedPred = 0; param->bEnableWeightedBiPred = 0; - -#if HIGH_BIT_DEPTH - /* SPSs shall have bit_depth_luma_minus8 equal to 0 only */ - x265_log(param, X265_LOG_ERROR, "Mainstillpicture profile not supported, compiled for Main10.\n"); - return -1; -#endif + } + else if (!strcmp(profile, "main10")) + { + if (!(param->internalCsp & X265_CSP_I420)) + { + x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n", + profile, x265_source_csp_names[param->internalCsp]); + return -1; + } } else if (!strcmp(profile, "main422-10")) - param->internalCsp = X265_CSP_I422; - else if (!strcmp(profile, "main444-8")) { - param->internalCsp = X265_CSP_I444; -#if HIGH_BIT_DEPTH - x265_log(param, X265_LOG_ERROR, "Main 4:4:4 8 profile not supported, compiled for Main10.\n"); - return -1; -#endif + if (!(param->internalCsp & (X265_CSP_I420 | X265_CSP_I422))) + { + x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n", + profile, x265_source_csp_names[param->internalCsp]); + return -1; + } + } + else if (!strcmp(profile, "main444-8") || !strcmp(profile, "main444-10")) + { + /* any color space allowed */ } - else if (!strcmp(profile, "main444-10")) - param->internalCsp = X265_CSP_I444; else { x265_log(param, X265_LOG_ERROR, "unknown profile <%s>\n", profile);
View file
x265_1.6.tar.gz/source/encoder/motion.cpp -> x265_1.7.tar.gz/source/encoder/motion.cpp
Changed
@@ -234,9 +234,14 @@ pix_base + (m1x) + (m1y) * stride, \ pix_base + (m2x) + (m2y) * stride, \ stride, costs); \ - (costs)[0] += mvcost((bmv + MV(m0x, m0y)) << 2); \ - (costs)[1] += mvcost((bmv + MV(m1x, m1y)) << 2); \ - (costs)[2] += mvcost((bmv + MV(m2x, m2y)) << 2); \ + const uint16_t *base_mvx = &m_cost_mvx[(bmv.x + (m0x)) << 2]; \ + const uint16_t *base_mvy = &m_cost_mvy[(bmv.y + (m0y)) << 2]; \ + X265_CHECK(mvcost((bmv + MV(m0x, m0y)) << 2) == (base_mvx[((m0x) - (m0x)) << 2] + base_mvy[((m0y) - (m0y)) << 2]), "mvcost() check failure\n"); \ + X265_CHECK(mvcost((bmv + MV(m1x, m1y)) << 2) == (base_mvx[((m1x) - (m0x)) << 2] + base_mvy[((m1y) - (m0y)) << 2]), "mvcost() check failure\n"); \ + X265_CHECK(mvcost((bmv + MV(m2x, m2y)) << 2) == (base_mvx[((m2x) - (m0x)) << 2] + base_mvy[((m2y) - (m0y)) << 2]), "mvcost() check failure\n"); \ + (costs)[0] += (base_mvx[((m0x) - (m0x)) << 2] + base_mvy[((m0y) - (m0y)) << 2]); \ + (costs)[1] += (base_mvx[((m1x) - (m0x)) << 2] + base_mvy[((m1y) - (m0y)) << 2]); \ + (costs)[2] += (base_mvx[((m2x) - (m0x)) << 2] + base_mvy[((m2y) - (m0y)) << 2]); \ } #define COST_MV_PT_DIST_X4(m0x, m0y, p0, d0, m1x, m1y, p1, d1, m2x, m2y, p2, d2, m3x, m3y, p3, d3) \ @@ -247,10 +252,10 @@ fref + (m2x) + (m2y) * stride, \ fref + (m3x) + (m3y) * stride, \ stride, costs); \ - costs[0] += mvcost(MV(m0x, m0y) << 2); \ - costs[1] += mvcost(MV(m1x, m1y) << 2); \ - costs[2] += mvcost(MV(m2x, m2y) << 2); \ - costs[3] += mvcost(MV(m3x, m3y) << 2); \ + (costs)[0] += mvcost(MV(m0x, m0y) << 2); \ + (costs)[1] += mvcost(MV(m1x, m1y) << 2); \ + (costs)[2] += mvcost(MV(m2x, m2y) << 2); \ + (costs)[3] += mvcost(MV(m3x, m3y) << 2); \ COPY4_IF_LT(bcost, costs[0], bmv, MV(m0x, m0y), bPointNr, p0, bDistance, d0); \ COPY4_IF_LT(bcost, costs[1], bmv, MV(m1x, m1y), bPointNr, p1, bDistance, d1); \ COPY4_IF_LT(bcost, costs[2], bmv, MV(m2x, m2y), bPointNr, p2, bDistance, d2); \ @@ -266,10 +271,16 @@ pix_base + (m2x) + (m2y) * stride, \ pix_base + (m3x) + (m3y) * stride, \ stride, costs); \ - costs[0] += mvcost((omv + MV(m0x, m0y)) << 2); \ - costs[1] += mvcost((omv + MV(m1x, m1y)) << 2); \ - costs[2] += mvcost((omv + MV(m2x, m2y)) << 2); \ - costs[3] += mvcost((omv + MV(m3x, m3y)) << 2); \ + const uint16_t *base_mvx = &m_cost_mvx[(omv.x << 2)]; \ + const uint16_t *base_mvy = &m_cost_mvy[(omv.y << 2)]; \ + X265_CHECK(mvcost((omv + MV(m0x, m0y)) << 2) == (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]), "mvcost() check failure\n"); \ + X265_CHECK(mvcost((omv + MV(m1x, m1y)) << 2) == (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]), "mvcost() check failure\n"); \ + X265_CHECK(mvcost((omv + MV(m2x, m2y)) << 2) == (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]), "mvcost() check failure\n"); \ + X265_CHECK(mvcost((omv + MV(m3x, m3y)) << 2) == (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]), "mvcost() check failure\n"); \ + costs[0] += (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]); \ + costs[1] += (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]); \ + costs[2] += (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]); \ + costs[3] += (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]); \ COPY2_IF_LT(bcost, costs[0], bmv, omv + MV(m0x, m0y)); \ COPY2_IF_LT(bcost, costs[1], bmv, omv + MV(m1x, m1y)); \ COPY2_IF_LT(bcost, costs[2], bmv, omv + MV(m2x, m2y)); \ @@ -285,10 +296,17 @@ pix_base + (m2x) + (m2y) * stride, \ pix_base + (m3x) + (m3y) * stride, \ stride, costs); \ - (costs)[0] += mvcost((bmv + MV(m0x, m0y)) << 2); \ - (costs)[1] += mvcost((bmv + MV(m1x, m1y)) << 2); \ - (costs)[2] += mvcost((bmv + MV(m2x, m2y)) << 2); \ - (costs)[3] += mvcost((bmv + MV(m3x, m3y)) << 2); \ + /* TODO: use restrict keyword in ICL */ \ + const uint16_t *base_mvx = &m_cost_mvx[(bmv.x << 2)]; \ + const uint16_t *base_mvy = &m_cost_mvy[(bmv.y << 2)]; \ + X265_CHECK(mvcost((bmv + MV(m0x, m0y)) << 2) == (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]), "mvcost() check failure\n"); \ + X265_CHECK(mvcost((bmv + MV(m1x, m1y)) << 2) == (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]), "mvcost() check failure\n"); \ + X265_CHECK(mvcost((bmv + MV(m2x, m2y)) << 2) == (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]), "mvcost() check failure\n"); \ + X265_CHECK(mvcost((bmv + MV(m3x, m3y)) << 2) == (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]), "mvcost() check failure\n"); \ + (costs)[0] += (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]); \ + (costs)[1] += (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]); \ + (costs)[2] += (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]); \ + (costs)[3] += (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]); \ } #define DIA1_ITER(mx, my) \
View file
x265_1.6.tar.gz/source/encoder/nal.cpp -> x265_1.7.tar.gz/source/encoder/nal.cpp
Changed
@@ -35,6 +35,7 @@ , m_extraBuffer(NULL) , m_extraOccupancy(0) , m_extraAllocSize(0) + , m_annexB(true) {} void NALList::takeContents(NALList& other) @@ -90,7 +91,12 @@ uint8_t *out = m_buffer + m_occupancy; uint32_t bytes = 0; - if (!m_numNal || nalUnitType == NAL_UNIT_VPS || nalUnitType == NAL_UNIT_SPS || nalUnitType == NAL_UNIT_PPS) + if (!m_annexB) + { + /* Will write size later */ + bytes += 4; + } + else if (!m_numNal || nalUnitType == NAL_UNIT_VPS || nalUnitType == NAL_UNIT_SPS || nalUnitType == NAL_UNIT_PPS) { memcpy(out, startCodePrefix, 4); bytes += 4; @@ -144,6 +150,16 @@ * to 0x03 is appended to the end of the data. */ if (!out[bytes - 1]) out[bytes++] = 0x03; + + if (!m_annexB) + { + uint32_t dataSize = bytes - 4; + out[0] = (uint8_t)(dataSize >> 24); + out[1] = (uint8_t)(dataSize >> 16); + out[2] = (uint8_t)(dataSize >> 8); + out[3] = (uint8_t)dataSize; + } + m_occupancy += bytes; X265_CHECK(m_numNal < (uint32_t)MAX_NAL_UNITS, "NAL count overflow\n");
View file
x265_1.6.tar.gz/source/encoder/nal.h -> x265_1.7.tar.gz/source/encoder/nal.h
Changed
@@ -48,6 +48,7 @@ uint8_t* m_extraBuffer; uint32_t m_extraOccupancy; uint32_t m_extraAllocSize; + bool m_annexB; NALList(); ~NALList() { X265_FREE(m_buffer); X265_FREE(m_extraBuffer); }
View file
x265_1.6.tar.gz/source/encoder/ratecontrol.cpp -> x265_1.7.tar.gz/source/encoder/ratecontrol.cpp
Changed
@@ -300,7 +300,7 @@ } } - /* qstep - value set as encoder specific */ + /* qpstep - value set as encoder specific */ m_lstep = pow(2, m_param->rc.qpStep / 6.0); for (int i = 0; i < 2; i++) @@ -370,14 +370,19 @@ m_accumPQp = (m_param->rc.rateControlMode == X265_RC_CRF ? CRF_INIT_QP : ABR_INIT_QP_MIN) * m_accumPNorm; /* Frame Predictors and Row predictors used in vbv */ - for (int i = 0; i < 5; i++) + for (int i = 0; i < 4; i++) { - m_pred[i].coeff = 1.5; + m_pred[i].coeff = 1.0; m_pred[i].count = 1.0; m_pred[i].decay = 0.5; m_pred[i].offset = 0.0; } - m_pred[0].coeff = 1.0; + m_pred[0].coeff = m_pred[3].coeff = 0.75; + if (m_param->rc.qCompress >= 0.8) // when tuned for grain + { + m_pred[1].coeff = 0.75; + m_pred[0].coeff = m_pred[3].coeff = 0.50; + } if (!m_statFileOut && (m_param->rc.bStatWrite || m_param->rc.bStatRead)) { /* If the user hasn't defined the stat filename, use the default value */ @@ -945,6 +950,9 @@ m_curSlice = curEncData.m_slice; m_sliceType = m_curSlice->m_sliceType; rce->sliceType = m_sliceType; + if (!m_2pass) + rce->keptAsRef = IS_REFERENCED(curFrame); + m_predType = getPredictorType(curFrame->m_lowres.sliceType, m_sliceType); rce->poc = m_curSlice->m_poc; if (m_param->rc.bStatRead) { @@ -1074,7 +1082,7 @@ m_lastQScaleFor[m_sliceType] = x265_qp2qScale(rce->qpaRc); if (rce->poc == 0) m_lastQScaleFor[P_SLICE] = m_lastQScaleFor[m_sliceType] * fabs(m_param->rc.ipFactor); - rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], m_qp, (double)m_currentSatd); + rce->frameSizePlanned = predictSize(&m_pred[m_predType], m_qp, (double)m_currentSatd); } } m_framesDone++; @@ -1105,6 +1113,14 @@ m_accumPQp += m_qp; } +int RateControl::getPredictorType(int lowresSliceType, int sliceType) +{ + /* Use a different predictor for B Ref and B frames for vbv frame size predictions */ + if (lowresSliceType == X265_TYPE_BREF) + return 3; + return sliceType; +} + double RateControl::getDiffLimitedQScale(RateControlEntry *rce, double q) { // force I/B quants as a function of P quants @@ -1379,6 +1395,7 @@ q += m_pbOffset; double qScale = x265_qp2qScale(q); + rce->qpNoVbv = q; double lmin = 0, lmax = 0; if (m_isVbv) { @@ -1391,16 +1408,15 @@ qScale = x265_clip3(lmin, lmax, qScale); q = x265_qScale2qp(qScale); } - rce->qpNoVbv = q; if (!m_2pass) { qScale = clipQscale(curFrame, rce, qScale); /* clip qp to permissible range after vbv-lookahead estimation to avoid possible * mispredictions by initial frame size predictors */ - if (m_pred[m_sliceType].count == 1) + if (m_pred[m_predType].count == 1) qScale = x265_clip3(lmin, lmax, qScale); m_lastQScaleFor[m_sliceType] = qScale; - rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], qScale, (double)m_currentSatd); + rce->frameSizePlanned = predictSize(&m_pred[m_predType], qScale, (double)m_currentSatd); } else rce->frameSizePlanned = qScale2bits(rce, qScale); @@ -1544,7 +1560,7 @@ q = clipQscale(curFrame, rce, q); /* clip qp to permissible range after vbv-lookahead estimation to avoid possible * mispredictions by initial frame size predictors */ - if (!m_2pass && m_isVbv && m_pred[m_sliceType].count == 1) + if (!m_2pass && m_isVbv && m_pred[m_predType].count == 1) q = x265_clip3(lqmin, lqmax, q); } m_lastQScaleFor[m_sliceType] = q; @@ -1554,7 +1570,7 @@ if (m_2pass && m_isVbv) rce->frameSizePlanned = qScale2bits(rce, q); else - rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd); + rce->frameSizePlanned = predictSize(&m_pred[m_predType], q, (double)m_currentSatd); /* Always use up the whole VBV in this case. */ if (m_singleFrameVbv) @@ -1707,7 +1723,7 @@ { double frameQ[3]; double curBits; - curBits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd); + curBits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd); double bufferFillCur = m_bufferFill - curBits; double targetFill; double totalDuration = m_frameDuration; @@ -1726,7 +1742,8 @@ bufferFillCur += wantedFrameSize; int64_t satd = curFrame->m_lowres.plannedSatd[j] >> (X265_DEPTH - 8); type = IS_X265_TYPE_I(type) ? I_SLICE : IS_X265_TYPE_B(type) ? B_SLICE : P_SLICE; - curBits = predictSize(&m_pred[type], frameQ[type], (double)satd); + int predType = getPredictorType(curFrame->m_lowres.plannedType[j], type); + curBits = predictSize(&m_pred[predType], frameQ[type], (double)satd); bufferFillCur -= curBits; } @@ -1766,7 +1783,7 @@ } // Now a hard threshold to make sure the frame fits in VBV. // This one is mostly for I-frames. - double bits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd); + double bits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd); // For small VBVs, allow the frame to use up the entire VBV. double maxFillFactor; @@ -1783,18 +1800,21 @@ bits *= qf; if (bits < m_bufferRate / minFillFactor) q *= bits * minFillFactor / m_bufferRate; - bits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd); + bits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd); } q = X265_MAX(q0, q); } /* Apply MinCR restrictions */ - double pbits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd); + double pbits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd); if (pbits > rce->frameSizeMaximum) q *= pbits / rce->frameSizeMaximum; - - if (!m_isCbr || (m_isAbr && m_currentSatd >= rce->movingAvgSum && q <= q0 / 2)) + /* To detect frames that are more complex in SATD costs compared to prev window, yet + * lookahead vbv reduces its qscale by half its value. Be on safer side and avoid drastic + * qscale reductions for frames high in complexity */ + bool mispredCheck = rce->movingAvgSum && m_currentSatd >= rce->movingAvgSum && q <= q0 / 2; + if (!m_isCbr || (m_isAbr && mispredCheck)) q = X265_MAX(q0, q); if (m_rateFactorMaxIncrement) @@ -1838,18 +1858,26 @@ if (satdCostForPendingCus > 0) { double pred_s = predictSize(rce->rowPred[0], qScale, satdCostForPendingCus); - uint32_t refRowSatdCost = 0, refRowBits = 0, intraCost = 0; + uint32_t refRowSatdCost = 0, refRowBits = 0, intraCostForPendingCus = 0; double refQScale = 0; if (picType != I_SLICE) { FrameData& refEncData = *refFrame->m_encData; uint32_t endCuAddr = maxCols * (row + 1); - for (uint32_t cuAddr = curEncData.m_rowStat[row].numEncodedCUs + 1; cuAddr < endCuAddr; cuAddr++) + uint32_t startCuAddr = curEncData.m_rowStat[row].numEncodedCUs; + if (startCuAddr) { - refRowSatdCost += refEncData.m_cuStat[cuAddr].vbvCost; - refRowBits += refEncData.m_cuStat[cuAddr].totalBits; - intraCost += curEncData.m_cuStat[cuAddr].intraVbvCost; + for (uint32_t cuAddr = startCuAddr + 1 ; cuAddr < endCuAddr; cuAddr++) + { + refRowSatdCost += refEncData.m_cuStat[cuAddr].vbvCost; + refRowBits += refEncData.m_cuStat[cuAddr].totalBits; + } + } + else + { + refRowBits = refEncData.m_rowStat[row].encodedBits; + refRowSatdCost = refEncData.m_rowStat[row].satdForVbv; } refRowSatdCost >>= X265_DEPTH - 8; @@ -1859,7 +1887,7 @@ if (picType == I_SLICE || qScale >= refQScale) { if (picType == P_SLICE - && !refFrame + && refFrame && refFrame->m_encData->m_slice->m_sliceType == picType && refQScale > 0 && refRowSatdCost > 0) @@ -1875,8 +1903,9 @@ } else if (picType == P_SLICE) { + intraCostForPendingCus = curEncData.m_rowStat[row].intraSatdForVbv - curEncData.m_rowStat[row].diagIntraSatd; /* Our QP is lower than the reference! */ - double pred_intra = predictSize(rce->rowPred[1], qScale, intraCost); + double pred_intra = predictSize(rce->rowPred[1], qScale, intraCostForPendingCus); /* Sum: better to overestimate than underestimate by using only one of the two predictors. */ totalSatdBits += (int32_t)(pred_intra + pred_s); } @@ -2099,8 +2128,10 @@ void RateControl::updateVbv(int64_t bits, RateControlEntry* rce) { + int predType = rce->sliceType; + predType = rce->sliceType == B_SLICE && rce->keptAsRef ? 3 : predType; if (rce->lastSatd >= m_ncu) - updatePredictor(&m_pred[rce->sliceType], x265_qp2qScale(rce->qpaRc), (double)rce->lastSatd, (double)bits); + updatePredictor(&m_pred[predType], x265_qp2qScale(rce->qpaRc), (double)rce->lastSatd, (double)bits); if (!m_isVbv) return; @@ -2156,23 +2187,24 @@ { if (m_isVbv) { + /* determine avg QP decided by VBV rate control */ for (uint32_t i = 0; i < slice->m_sps->numCuInHeight; i++) curEncData.m_avgQpRc += curEncData.m_rowStat[i].sumQpRc; curEncData.m_avgQpRc /= slice->m_sps->numCUsInFrame; rce->qpaRc = curEncData.m_avgQpRc; - - // copy avg RC qp to m_avgQpAq. To print out the correct qp when aq/cutree is disabled. - curEncData.m_avgQpAq = curEncData.m_avgQpRc; } if (m_param->rc.aqMode) { + /* determine actual avg encoded QP, after AQ/cutree adjustments */ for (uint32_t i = 0; i < slice->m_sps->numCuInHeight; i++) curEncData.m_avgQpAq += curEncData.m_rowStat[i].sumQpAq; - curEncData.m_avgQpAq /= slice->m_sps->numCUsInFrame; + curEncData.m_avgQpAq /= (slice->m_sps->numCUsInFrame * NUM_4x4_PARTITIONS); } + else + curEncData.m_avgQpAq = curEncData.m_avgQpRc; } // Write frame stats into the stats file if 2 pass is enabled. @@ -2301,7 +2333,7 @@ { m_finalFrameCount = count; /* unblock waiting threads */ - m_startEndOrder.set(m_startEndOrder.get()); + m_startEndOrder.poke(); } /* called when the encoder is closing, and no more frames will be output. @@ -2311,7 +2343,7 @@ { m_bTerminated = true; /* unblock waiting threads */ - m_startEndOrder.set(m_startEndOrder.get()); + m_startEndOrder.poke(); } void RateControl::destroy()
View file
x265_1.6.tar.gz/source/encoder/ratecontrol.h -> x265_1.7.tar.gz/source/encoder/ratecontrol.h
Changed
@@ -157,10 +157,9 @@ double m_rateFactorMaxIncrement; /* Don't allow RF above (CRF + this value). */ double m_rateFactorMaxDecrement; /* don't allow RF below (this value). */ - Predictor m_pred[5]; - Predictor m_predBfromP; - + Predictor m_pred[4]; /* Slice predictors to preidct bits for each Slice type - I,P,Bref and B */ int64_t m_leadingNoBSatd; + int m_predType; /* Type of slice predictors to be used - depends on the slice type */ double m_ipOffset; double m_pbOffset; int64_t m_bframeBits; @@ -266,6 +265,7 @@ double tuneAbrQScaleFromFeedback(double qScale); void accumPQpUpdate(); + int getPredictorType(int lowresSliceType, int sliceType); void updateVbv(int64_t bits, RateControlEntry* rce); void updatePredictor(Predictor *p, double q, double var, double bits); double clipQscale(Frame* pic, RateControlEntry* rce, double q);
View file
x265_1.6.tar.gz/source/encoder/rdcost.h -> x265_1.7.tar.gz/source/encoder/rdcost.h
Changed
@@ -40,13 +40,15 @@ uint32_t m_chromaDistWeight[2]; uint32_t m_psyRdBase; uint32_t m_psyRd; - int m_qp; + int m_qp; /* QP used to configure lambda, may be higher than QP_MAX_SPEC but <= QP_MAX_MAX */ void setPsyRdScale(double scale) { m_psyRdBase = (uint32_t)floor(65536.0 * scale * 0.33); } void setQP(const Slice& slice, int qp) { + x265_emms(); /* TODO: if the lambda tables were ints, this would not be necessary */ m_qp = qp; + setLambda(x265_lambda2_tab[qp], x265_lambda_tab[qp]); /* Scale PSY RD factor by a slice type factor */ static const uint32_t psyScaleFix8[3] = { 300, 256, 96 }; /* B, P, I */ @@ -60,19 +62,21 @@ } int qpCb, qpCr; - setLambda(x265_lambda2_tab[qp], x265_lambda_tab[qp]); if (slice.m_sps->chromaFormatIdc == X265_CSP_I420) - qpCb = x265_clip3(QP_MIN, QP_MAX_MAX, (int)g_chromaScale[qp + slice.m_pps->chromaQpOffset[0]]); + { + qpCb = (int)g_chromaScale[x265_clip3(QP_MIN, QP_MAX_MAX, qp + slice.m_pps->chromaQpOffset[0])]; + qpCr = (int)g_chromaScale[x265_clip3(QP_MIN, QP_MAX_MAX, qp + slice.m_pps->chromaQpOffset[1])]; + } else - qpCb = X265_MIN(qp + slice.m_pps->chromaQpOffset[0], QP_MAX_SPEC); + { + qpCb = x265_clip3(QP_MIN, QP_MAX_SPEC, qp + slice.m_pps->chromaQpOffset[0]); + qpCr = x265_clip3(QP_MIN, QP_MAX_SPEC, qp + slice.m_pps->chromaQpOffset[1]); + } + int chroma_offset_idx = X265_MIN(qp - qpCb + 12, MAX_CHROMA_LAMBDA_OFFSET); uint16_t lambdaOffset = m_psyRd ? x265_chroma_lambda2_offset_tab[chroma_offset_idx] : 256; m_chromaDistWeight[0] = lambdaOffset; - if (slice.m_sps->chromaFormatIdc == X265_CSP_I420) - qpCr = x265_clip3(QP_MIN, QP_MAX_MAX, (int)g_chromaScale[qp + slice.m_pps->chromaQpOffset[0]]); - else - qpCr = X265_MIN(qp + slice.m_pps->chromaQpOffset[0], QP_MAX_SPEC); chroma_offset_idx = X265_MIN(qp - qpCr + 12, MAX_CHROMA_LAMBDA_OFFSET); lambdaOffset = m_psyRd ? x265_chroma_lambda2_offset_tab[chroma_offset_idx] : 256; m_chromaDistWeight[1] = lambdaOffset;
View file
x265_1.6.tar.gz/source/encoder/sao.cpp -> x265_1.7.tar.gz/source/encoder/sao.cpp
Changed
@@ -258,7 +258,7 @@ pixel* tmpL; pixel* tmpU; - int8_t _upBuff1[MAX_CU_SIZE + 2], *upBuff1 = _upBuff1 + 1; + int8_t _upBuff1[MAX_CU_SIZE + 2], *upBuff1 = _upBuff1 + 1, signLeft1[2]; int8_t _upBufft[MAX_CU_SIZE + 2], *upBufft = _upBufft + 1; memset(_upBuff1 + MAX_CU_SIZE, 0, 2 * sizeof(int8_t)); /* avoid valgrind uninit warnings */ @@ -279,7 +279,7 @@ { case SAO_EO_0: // dir: - { - pixel firstPxl = 0, lastPxl = 0; + pixel firstPxl = 0, lastPxl = 0, row1FirstPxl = 0, row1LastPxl = 0; startX = !lpelx; endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth; if (ctuWidth & 15) @@ -301,25 +301,38 @@ } else { - for (y = 0; y < ctuHeight; y++) + for (y = 0; y < ctuHeight; y += 2) { - int signLeft = signOf(rec[startX] - tmpL[y]); + signLeft1[0] = signOf(rec[startX] - tmpL[y]); + signLeft1[1] = signOf(rec[stride + startX] - tmpL[y + 1]); if (!lpelx) + { firstPxl = rec[0]; + row1FirstPxl = rec[stride]; + } if (rpelx == picWidth) + { lastPxl = rec[ctuWidth - 1]; + row1LastPxl = rec[stride + ctuWidth - 1]; + } - primitives.saoCuOrgE0(rec, m_offsetEo, ctuWidth, (int8_t)signLeft); + primitives.saoCuOrgE0(rec, m_offsetEo, ctuWidth, signLeft1, stride); if (!lpelx) + { rec[0] = firstPxl; + rec[stride] = row1FirstPxl; + } if (rpelx == picWidth) + { rec[ctuWidth - 1] = lastPxl; + rec[stride + ctuWidth - 1] = row1LastPxl; + } - rec += stride; + rec += 2 * stride; } } break; @@ -354,11 +367,14 @@ { primitives.sign(upBuff1, rec, tmpU, ctuWidth); - for (y = startY; y < endY; y++) + int diff = (endY - startY) % 2; + for (y = startY; y < endY - diff; y += 2) { - primitives.saoCuOrgE1(rec, upBuff1, m_offsetEo, stride, ctuWidth); - rec += stride; + primitives.saoCuOrgE1_2Rows(rec, upBuff1, m_offsetEo, stride, ctuWidth); + rec += 2 * stride; } + if (diff & 1) + primitives.saoCuOrgE1(rec, upBuff1, m_offsetEo, stride, ctuWidth); } break; @@ -421,23 +437,8 @@ for (y = startY; y < endY; y++) { int8_t iSignDown2 = signOf(rec[stride + startX] - tmpL[y]); - pixel firstPxl = rec[0]; // copy first Pxl - pixel lastPxl = rec[ctuWidth - 1]; - int8_t one = upBufft[1]; - int8_t two = upBufft[endX + 1]; - primitives.saoCuOrgE2(rec, upBufft, upBuff1, m_offsetEo, ctuWidth, stride); - if (!lpelx) - { - rec[0] = firstPxl; - upBufft[1] = one; - } - - if (rpelx == picWidth) - { - rec[ctuWidth - 1] = lastPxl; - upBufft[endX + 1] = two; - } + primitives.saoCuOrgE2[endX > 16](rec + startX, upBufft + startX, upBuff1 + startX, m_offsetEo, endX - startX, stride); upBufft[startX] = iSignDown2; @@ -508,7 +509,7 @@ upBuff1[x - 1] = -signDown; rec[x] = m_clipTable[rec[x] + m_offsetEo[edgeType]]; - primitives.saoCuOrgE3(rec, upBuff1, m_offsetEo, stride - 1, startX, endX); + primitives.saoCuOrgE3[endX > 16](rec, upBuff1, m_offsetEo, stride - 1, startX, endX); upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]); @@ -783,13 +784,7 @@ rec += stride; } - if (!(ctuWidth & 15)) - primitives.sign(upBuff1, rec, &rec[- stride], ctuWidth); - else - { - for (x = 0; x < ctuWidth; x++) - upBuff1[x] = signOf(rec[x] - rec[x - stride]); - } + primitives.sign(upBuff1, rec, &rec[- stride], ctuWidth); for (y = startY; y < endY; y++) { @@ -832,8 +827,7 @@ rec += stride; } - for (x = startX; x < endX; x++) - upBuff1[x] = signOf(rec[x] - rec[x - stride - 1]); + primitives.sign(&upBuff1[startX], &rec[startX], &rec[startX - stride - 1], (endX - startX)); for (y = startY; y < endY; y++) { @@ -879,8 +873,7 @@ rec += stride; } - for (x = startX - 1; x < endX; x++) - upBuff1[x] = signOf(rec[x] - rec[x - stride + 1]); + primitives.sign(&upBuff1[startX - 1], &rec[startX - 1], &rec[startX - 1 - stride + 1], (endX - startX + 1)); for (y = startY; y < endY; y++) {
View file
x265_1.6.tar.gz/source/encoder/search.cpp -> x265_1.7.tar.gz/source/encoder/search.cpp
Changed
@@ -163,11 +163,16 @@ X265_FREE(m_tsRecon); } -void Search::setQP(const Slice& slice, int qp) +int Search::setLambdaFromQP(const CUData& ctu, int qp) { - x265_emms(); /* TODO: if the lambda tables were ints, this would not be necessary */ + X265_CHECK(qp >= QP_MIN && qp <= QP_MAX_MAX, "QP used for lambda is out of range\n"); + m_me.setQP(qp); - m_rdCost.setQP(slice, qp); + m_rdCost.setQP(*m_slice, qp); + + int quantQP = x265_clip3(QP_MIN, QP_MAX_SPEC, qp); + m_quant.setQPforQuant(ctu, quantQP); + return quantQP; } #if CHECKED_BUILD || _DEBUG @@ -1185,7 +1190,7 @@ intraMode.psyEnergy = m_rdCost.psyCost(cuGeom.log2CUSize - 2, fencYuv->m_buf[0], fencYuv->m_size, intraMode.reconYuv.m_buf[0], intraMode.reconYuv.m_size); } updateModeCost(intraMode); - checkDQP(cu, cuGeom); + checkDQP(intraMode, cuGeom); } /* Note that this function does not save the best intra prediction, it must @@ -1231,16 +1236,11 @@ pixel nScale[129]; intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0]; - primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1, 0); + primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1); // we do not estimate filtering for downscaled samples - for (int x = 1; x < 65; x++) - { - intraNeighbourBuf[0][x] = nScale[x]; // Top pixel - intraNeighbourBuf[0][x + 64] = nScale[x + 64]; // Left pixel - intraNeighbourBuf[1][x] = nScale[x]; // Top pixel - intraNeighbourBuf[1][x + 64] = nScale[x + 64]; // Left pixel - } + memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel)); // Top & Left pixels + memcpy(&intraNeighbourBuf[1][1], &nScale[1], 2 * 64 * sizeof(pixel)); scaleTuSize = 32; scaleStride = 32; @@ -1369,8 +1369,6 @@ X265_CHECK(cu.m_partSize[0] == SIZE_2Nx2N, "encodeIntraInInter does not expect NxN intra\n"); X265_CHECK(!m_slice->isIntra(), "encodeIntraInInter does not expect to be used in I slices\n"); - m_quant.setQPforQuant(cu); - uint32_t tuDepthRange[2]; cu.getIntraTUQtDepthRange(tuDepthRange, 0); @@ -1405,7 +1403,7 @@ m_entropyCoder.store(intraMode.contexts); updateModeCost(intraMode); - checkDQP(intraMode.cu, cuGeom); + checkDQP(intraMode, cuGeom); } uint32_t Search::estIntraPredQT(Mode &intraMode, const CUGeom& cuGeom, const uint32_t depthRange[2], uint8_t* sharedModes) @@ -1465,16 +1463,10 @@ pixel nScale[129]; intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0]; - primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1, 0); + primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1); - // TO DO: primitive - for (int x = 1; x < 65; x++) - { - intraNeighbourBuf[0][x] = nScale[x]; // Top pixel - intraNeighbourBuf[0][x + 64] = nScale[x + 64]; // Left pixel - intraNeighbourBuf[1][x] = nScale[x]; // Top pixel - intraNeighbourBuf[1][x + 64] = nScale[x + 64]; // Left pixel - } + memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel)); + memcpy(&intraNeighbourBuf[1][1], &nScale[1], 2 * 64 * sizeof(pixel)); scaleTuSize = 32; scaleStride = 32; @@ -1869,6 +1861,34 @@ return outCost; } +/* Pick between the two AMVP candidates which is the best one to use as + * MVP for the motion search, based on SAD cost */ +int Search::selectMVP(const CUData& cu, const PredictionUnit& pu, const MV amvp[AMVP_NUM_CANDS], int list, int ref) +{ + if (amvp[0] == amvp[1]) + return 0; + + Yuv& tmpPredYuv = m_rqt[cu.m_cuDepth[0]].tmpPredYuv; + uint32_t costs[AMVP_NUM_CANDS]; + + for (int i = 0; i < AMVP_NUM_CANDS; i++) + { + MV mvCand = amvp[i]; + + // NOTE: skip mvCand if Y is > merange and -FN>1 + if (m_bFrameParallel && (mvCand.y >= (m_param->searchRange + 1) * 4)) + costs[i] = m_me.COST_MAX; + else + { + cu.clipMv(mvCand); + predInterLumaPixel(pu, tmpPredYuv, *m_slice->m_refPicList[list][ref]->m_reconPic, mvCand); + costs[i] = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size); + } + } + + return costs[0] <= costs[1] ? 0 : 1; +} + void Search::PME::processTasks(int workerThreadId) { #if DETAILED_CU_STATS @@ -1899,10 +1919,10 @@ /* Setup slave Search instance for ME for master's CU */ if (&slave != this) { - slave.setQP(*m_slice, m_rdCost.m_qp); slave.m_slice = m_slice; slave.m_frame = m_frame; - + slave.m_param = m_param; + slave.setLambdaFromQP(pme.mode.cu, m_rdCost.m_qp); slave.m_me.setSourcePU(*pme.mode.fencYuv, pme.pu.ctuAddr, pme.pu.cuAbsPartIdx, pme.pu.puAbsPartIdx, pme.pu.width, pme.pu.height); } @@ -1910,9 +1930,9 @@ do { if (meId < m_slice->m_numRefIdx[0]) - slave.singleMotionEstimation(*this, pme.mode, pme.cuGeom, pme.pu, pme.puIdx, 0, meId); + slave.singleMotionEstimation(*this, pme.mode, pme.pu, pme.puIdx, 0, meId); else - slave.singleMotionEstimation(*this, pme.mode, pme.cuGeom, pme.pu, pme.puIdx, 1, meId - m_slice->m_numRefIdx[0]); + slave.singleMotionEstimation(*this, pme.mode, pme.pu, pme.puIdx, 1, meId - m_slice->m_numRefIdx[0]); meId = -1; pme.m_lock.acquire(); @@ -1923,55 +1943,30 @@ while (meId >= 0); } -void Search::singleMotionEstimation(Search& master, Mode& interMode, const CUGeom& cuGeom, const PredictionUnit& pu, - int part, int list, int ref) +void Search::singleMotionEstimation(Search& master, Mode& interMode, const PredictionUnit& pu, int part, int list, int ref) { uint32_t bits = master.m_listSelBits[list] + MVP_IDX_BITS; bits += getTUBits(ref, m_slice->m_numRefIdx[list]); - MV mvc[(MD_ABOVE_LEFT + 1) * 2 + 1]; - int numMvc = interMode.cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc); - - int mvpIdx = 0; - int merange = m_param->searchRange; MotionData* bestME = interMode.bestME[part]; - if (interMode.amvpCand[list][ref][0] != interMode.amvpCand[list][ref][1]) - { - uint32_t bestCost = MAX_INT; - for (int i = 0; i < AMVP_NUM_CANDS; i++) - { - MV mvCand = interMode.amvpCand[list][ref][i]; - - // NOTE: skip mvCand if Y is > merange and -FN>1 - if (m_bFrameParallel && (mvCand.y >= (merange + 1) * 4)) - continue; - - interMode.cu.clipMv(mvCand); - - Yuv& tmpPredYuv = m_rqt[cuGeom.depth].tmpPredYuv; - predInterLumaPixel(pu, tmpPredYuv, *m_slice->m_refPicList[list][ref]->m_reconPic, mvCand); - uint32_t cost = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size); + MV mvc[(MD_ABOVE_LEFT + 1) * 2 + 1]; + int numMvc = interMode.cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc); - if (bestCost > cost) - { - bestCost = cost; - mvpIdx = i; - } - } - } + const MV* amvp = interMode.amvpCand[list][ref]; + int mvpIdx = selectMVP(interMode.cu, pu, amvp, list, ref); + MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx]; - MV mvmin, mvmax, outmv, mvp = interMode.amvpCand[list][ref][mvpIdx]; - setSearchRange(interMode.cu, mvp, merange, mvmin, mvmax); + setSearchRange(interMode.cu, mvp, m_param->searchRange, mvmin, mvmax); - int satdCost = m_me.motionEstimate(&m_slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, merange, outmv); + int satdCost = m_me.motionEstimate(&m_slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv); /* Get total cost of partition, but only include MV bit cost once */ bits += m_me.bitcost(outmv); uint32_t cost = (satdCost - m_me.mvcost(outmv)) + m_rdCost.getCost(bits); - /* Refine MVP selection, updates: mvp, mvpIdx, bits, cost */ - checkBestMVP(interMode.amvpCand[list][ref], outmv, mvp, mvpIdx, bits, cost); + /* Refine MVP selection, updates: mvpIdx, bits, cost */ + mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost); /* tie goes to the smallest ref ID, just like --no-pme */ ScopedLock _lock(master.m_meLock); @@ -1988,7 +1983,7 @@ } /* find the best inter prediction for each PU of specified mode */ -void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bMergeOnly, bool bChromaSA8D) +void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC) { ProfileCUScope(interMode.cu, motionEstimationElapsedTime, countMotionEstimate); @@ -2009,7 +2004,6 @@ Yuv& tmpPredYuv = m_rqt[cuGeom.depth].tmpPredYuv; MergeData merge; - uint32_t mrgCost; memset(&merge, 0, sizeof(merge)); for (int puIdx = 0; puIdx < numPart; puIdx++) @@ -2020,27 +2014,7 @@ m_me.setSourcePU(*interMode.fencYuv, pu.ctuAddr, pu.cuAbsPartIdx, pu.puAbsPartIdx, pu.width, pu.height); /* find best cost merge candidate. note: 2Nx2N merge and bidir are handled as separate modes */ - if (cu.m_partSize[0] != SIZE_2Nx2N) - { - mrgCost = mergeEstimation(cu, cuGeom, pu, puIdx, merge); - - if (bMergeOnly && mrgCost != MAX_UINT) - { - cu.m_mergeFlag[pu.puAbsPartIdx] = true; - cu.m_mvpIdx[0][pu.puAbsPartIdx] = merge.index; // merge candidate ID is stored in L0 MVP idx - cu.setPUInterDir(merge.dir, pu.puAbsPartIdx, puIdx); - cu.setPUMv(0, merge.mvField[0].mv, pu.puAbsPartIdx, puIdx); - cu.setPURefIdx(0, merge.mvField[0].refIdx, pu.puAbsPartIdx, puIdx); - cu.setPUMv(1, merge.mvField[1].mv, pu.puAbsPartIdx, puIdx); - cu.setPURefIdx(1, merge.mvField[1].refIdx, pu.puAbsPartIdx, puIdx); - totalmebits += merge.bits; - - motionCompensation(cu, pu, *predYuv, true, bChromaSA8D); - continue; - } - } - else - mrgCost = MAX_UINT; + uint32_t mrgCost = numPart == 1 ? MAX_UINT : mergeEstimation(cu, cuGeom, pu, puIdx, merge); bestME[0].cost = MAX_UINT; bestME[1].cost = MAX_UINT; @@ -2061,45 +2035,19 @@ int numMvc = cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc); - // Pick the best possible MVP from AMVP candidates based on least residual - int mvpIdx = 0; - int merange = m_param->searchRange; + const MV* amvp = interMode.amvpCand[list][ref]; + int mvpIdx = selectMVP(cu, pu, amvp, list, ref); + MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx]; - if (interMode.amvpCand[list][ref][0] != interMode.amvpCand[list][ref][1]) - { - uint32_t bestCost = MAX_INT; - for (int i = 0; i < AMVP_NUM_CANDS; i++) - { - MV mvCand = interMode.amvpCand[list][ref][i]; - - // NOTE: skip mvCand if Y is > merange and -FN>1 - if (m_bFrameParallel && (mvCand.y >= (merange + 1) * 4)) - continue; - - cu.clipMv(mvCand); - predInterLumaPixel(pu, tmpPredYuv, *slice->m_refPicList[list][ref]->m_reconPic, mvCand); - uint32_t cost = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size); - - if (bestCost > cost) - { - bestCost = cost; - mvpIdx = i; - } - } - } - - MV mvmin, mvmax, outmv, mvp = interMode.amvpCand[list][ref][mvpIdx]; - - int satdCost; - setSearchRange(cu, mvp, merange, mvmin, mvmax); - satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, merange, outmv); + setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax); + int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv); /* Get total cost of partition, but only include MV bit cost once */ bits += m_me.bitcost(outmv); uint32_t cost = (satdCost - m_me.mvcost(outmv)) + m_rdCost.getCost(bits); - /* Refine MVP selection, updates: mvp, mvpIdx, bits, cost */ - checkBestMVP(interMode.amvpCand[list][ref], outmv, mvp, mvpIdx, bits, cost); + /* Refine MVP selection, updates: mvpIdx, bits, cost */ + mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost); if (cost < bestME[list].cost) { @@ -2122,7 +2070,7 @@ { processPME(pme, *this); - singleMotionEstimation(*this, interMode, cuGeom, pu, puIdx, 0, 0); /* L0-0 */ + singleMotionEstimation(*this, interMode, pu, puIdx, 0, 0); /* L0-0 */ bDoUnidir = false; @@ -2144,44 +2092,19 @@ int numMvc = cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc); - // Pick the best possible MVP from AMVP candidates based on least residual - int mvpIdx = 0; - int merange = m_param->searchRange; - - if (interMode.amvpCand[list][ref][0] != interMode.amvpCand[list][ref][1]) - { - uint32_t bestCost = MAX_INT; - for (int i = 0; i < AMVP_NUM_CANDS; i++) - { - MV mvCand = interMode.amvpCand[list][ref][i]; - - // NOTE: skip mvCand if Y is > merange and -FN>1 - if (m_bFrameParallel && (mvCand.y >= (merange + 1) * 4)) - continue; + const MV* amvp = interMode.amvpCand[list][ref]; + int mvpIdx = selectMVP(cu, pu, amvp, list, ref); + MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx]; - cu.clipMv(mvCand); - predInterLumaPixel(pu, tmpPredYuv, *slice->m_refPicList[list][ref]->m_reconPic, mvCand); - uint32_t cost = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size); - - if (bestCost > cost) - { - bestCost = cost; - mvpIdx = i; - } - } - } - - MV mvmin, mvmax, outmv, mvp = interMode.amvpCand[list][ref][mvpIdx]; - - setSearchRange(cu, mvp, merange, mvmin, mvmax); - int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, merange, outmv); + setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax); + int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv); /* Get total cost of partition, but only include MV bit cost once */ bits += m_me.bitcost(outmv); uint32_t cost = (satdCost - m_me.mvcost(outmv)) + m_rdCost.getCost(bits); - /* Refine MVP selection, updates: mvp, mvpIdx, bits, cost */ - checkBestMVP(interMode.amvpCand[list][ref], outmv, mvp, mvpIdx, bits, cost); + /* Refine MVP selection, updates: mvpIdx, bits, cost */ + mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost); if (cost < bestME[list].cost) { @@ -2289,8 +2212,8 @@ uint32_t cost = satdCost + m_rdCost.getCost(bits0) + m_rdCost.getCost(bits1); /* refine MVP selection for zero mv, updates: mvp, mvpidx, bits, cost */ - checkBestMVP(interMode.amvpCand[0][bestME[0].ref], mvzero, mvp0, mvpIdx0, bits0, cost); - checkBestMVP(interMode.amvpCand[1][bestME[1].ref], mvzero, mvp1, mvpIdx1, bits1, cost); + mvp0 = checkBestMVP(interMode.amvpCand[0][bestME[0].ref], mvzero, mvpIdx0, bits0, cost); + mvp1 = checkBestMVP(interMode.amvpCand[1][bestME[1].ref], mvzero, mvpIdx1, bits1, cost); if (cost < bidirCost) { @@ -2370,7 +2293,7 @@ totalmebits += bestME[1].bits; } - motionCompensation(cu, pu, *predYuv, true, bChromaSA8D); + motionCompensation(cu, pu, *predYuv, true, bChromaMC); } X265_CHECK(interMode.ok(), "inter mode is not ok"); interMode.sa8dBits += totalmebits; @@ -2429,27 +2352,21 @@ } /* Check if using an alternative MVP would result in a smaller MVD + signal bits */ -void Search::checkBestMVP(MV* amvpCand, MV mv, MV& mvPred, int& outMvpIdx, uint32_t& outBits, uint32_t& outCost) const +const MV& Search::checkBestMVP(const MV* amvpCand, const MV& mv, int& mvpIdx, uint32_t& outBits, uint32_t& outCost) const { - X265_CHECK(amvpCand[outMvpIdx] == mvPred, "checkBestMVP: unexpected mvPred\n"); - - int mvpIdx = !outMvpIdx; - MV mvp = amvpCand[mvpIdx]; - int diffBits = m_me.bitcost(mv, mvp) - m_me.bitcost(mv, mvPred); + int diffBits = m_me.bitcost(mv, amvpCand[!mvpIdx]) - m_me.bitcost(mv, amvpCand[mvpIdx]); if (diffBits < 0) { - outMvpIdx = mvpIdx; - mvPred = mvp; + mvpIdx = !mvpIdx; uint32_t origOutBits = outBits; outBits = origOutBits + diffBits; outCost = (outCost - m_rdCost.getCost(origOutBits)) + m_rdCost.getCost(outBits); } + return amvpCand[mvpIdx]; } -void Search::setSearchRange(const CUData& cu, MV mvp, int merange, MV& mvmin, MV& mvmax) const +void Search::setSearchRange(const CUData& cu, const MV& mvp, int merange, MV& mvmin, MV& mvmax) const { - cu.clipMv(mvp); - MV dist((int16_t)merange << 2, (int16_t)merange << 2); mvmin = mvp - dist; mvmax = mvp + dist; @@ -2534,9 +2451,6 @@ uint32_t log2CUSize = cuGeom.log2CUSize; int sizeIdx = log2CUSize - 2; - uint32_t tqBypass = cu.m_tqBypass[0]; - m_quant.setQPforQuant(interMode.cu); - resiYuv->subtract(*fencYuv, *predYuv, log2CUSize); uint32_t tuDepthRange[2]; @@ -2547,6 +2461,7 @@ Cost costs; estimateResidualQT(interMode, cuGeom, 0, 0, *resiYuv, costs, tuDepthRange); + uint32_t tqBypass = cu.m_tqBypass[0]; if (!tqBypass) { uint32_t cbf0Dist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size); @@ -2631,7 +2546,7 @@ interMode.coeffBits = coeffBits; interMode.mvBits = bits - coeffBits; updateModeCost(interMode); - checkDQP(interMode.cu, cuGeom); + checkDQP(interMode, cuGeom); } void Search::residualTransformQuantInter(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t tuDepth, const uint32_t depthRange[2]) @@ -3448,22 +3363,43 @@ } } -void Search::checkDQP(CUData& cu, const CUGeom& cuGeom) +void Search::checkDQP(Mode& mode, const CUGeom& cuGeom) { + CUData& cu = mode.cu; if (cu.m_slice->m_pps->bUseDQP && cuGeom.depth <= cu.m_slice->m_pps->maxCuDQPDepth) { if (cu.getQtRootCbf(0)) { - /* When analysing RDO with DQP bits, the entropy encoder should add the cost of DQP bits here - * i.e Encode QP */ + if (m_param->rdLevel >= 3) + { + mode.contexts.resetBits(); + mode.contexts.codeDeltaQP(cu, 0); + uint32_t bits = mode.contexts.getNumberOfWrittenBits(); + mode.mvBits += bits; + mode.totalBits += bits; + updateModeCost(mode); + } + else if (m_param->rdLevel <= 1) + { + mode.sa8dBits++; + mode.sa8dCost = m_rdCost.calcRdSADCost(mode.distortion, mode.sa8dBits); + } + else + { + mode.mvBits++; + mode.totalBits++; + updateModeCost(mode); + } } else cu.setQPSubParts(cu.getRefQP(0), 0, cuGeom.depth); } } -void Search::checkDQPForSplitPred(CUData& cu, const CUGeom& cuGeom) +void Search::checkDQPForSplitPred(Mode& mode, const CUGeom& cuGeom) { + CUData& cu = mode.cu; + if ((cuGeom.depth == cu.m_slice->m_pps->maxCuDQPDepth) && cu.m_slice->m_pps->bUseDQP) { bool hasResidual = false; @@ -3478,10 +3414,31 @@ } } if (hasResidual) - /* TODO: Encode QP, and recalculate RD cost of splitPred */ + { + if (m_param->rdLevel >= 3) + { + mode.contexts.resetBits(); + mode.contexts.codeDeltaQP(cu, 0); + uint32_t bits = mode.contexts.getNumberOfWrittenBits(); + mode.mvBits += bits; + mode.totalBits += bits; + updateModeCost(mode); + } + else if (m_param->rdLevel <= 1) + { + mode.sa8dBits++; + mode.sa8dCost = m_rdCost.calcRdSADCost(mode.distortion, mode.sa8dBits); + } + else + { + mode.mvBits++; + mode.totalBits++; + updateModeCost(mode); + } /* For all zero CBF sub-CUs, reset QP to RefQP (so that deltaQP is not signalled). When the non-zero CBF sub-CU is found, stop */ cu.setQPSubCUs(cu.getRefQP(0), 0, cuGeom.depth); + } else /* No residual within this CU or subCU, so reset QP to RefQP */ cu.setQPSubParts(cu.getRefQP(0), 0, cuGeom.depth);
View file
x265_1.6.tar.gz/source/encoder/search.h -> x265_1.7.tar.gz/source/encoder/search.h
Changed
@@ -287,7 +287,7 @@ ~Search(); bool initSearch(const x265_param& param, ScalingList& scalingList); - void setQP(const Slice& slice, int qp); + int setLambdaFromQP(const CUData& ctu, int qp); /* returns real quant QP in valid spec range */ // mark temp RD entropy contexts as uninitialized; useful for finding loads without stores void invalidateContexts(int fromDepth); @@ -301,7 +301,7 @@ void encodeIntraInInter(Mode& intraMode, const CUGeom& cuGeom); // estimation inter prediction (non-skip) - void predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bMergeOnly, bool bChroma); + void predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC); // encode residual and compute rd-cost for inter mode void encodeResAndCalcRdInterCU(Mode& interMode, const CUGeom& cuGeom); @@ -316,8 +316,8 @@ void getBestIntraModeChroma(Mode& intraMode, const CUGeom& cuGeom); /* update CBF flags and QP values to be internally consistent */ - void checkDQP(CUData& cu, const CUGeom& cuGeom); - void checkDQPForSplitPred(CUData& cu, const CUGeom& cuGeom); + void checkDQP(Mode& mode, const CUGeom& cuGeom); + void checkDQPForSplitPred(Mode& mode, const CUGeom& cuGeom); class PME : public BondedTaskGroup { @@ -339,7 +339,7 @@ }; void processPME(PME& pme, Search& slave); - void singleMotionEstimation(Search& master, Mode& interMode, const CUGeom& cuGeom, const PredictionUnit& pu, int part, int list, int ref); + void singleMotionEstimation(Search& master, Mode& interMode, const PredictionUnit& pu, int part, int list, int ref); protected: @@ -396,8 +396,9 @@ }; /* inter/ME helper functions */ - void checkBestMVP(MV* amvpCand, MV cMv, MV& mvPred, int& mvpIdx, uint32_t& outBits, uint32_t& outCost) const; - void setSearchRange(const CUData& cu, MV mvp, int merange, MV& mvmin, MV& mvmax) const; + int selectMVP(const CUData& cu, const PredictionUnit& pu, const MV amvp[AMVP_NUM_CANDS], int list, int ref); + const MV& checkBestMVP(const MV amvpCand[2], const MV& mv, int& mvpIdx, uint32_t& outBits, uint32_t& outCost) const; + void setSearchRange(const CUData& cu, const MV& mvp, int merange, MV& mvmin, MV& mvmax) const; uint32_t mergeEstimation(CUData& cu, const CUGeom& cuGeom, const PredictionUnit& pu, int puIdx, MergeData& m); static void getBlkBits(PartSize cuMode, bool bPSlice, int puIdx, uint32_t lastMode, uint32_t blockBit[3]);
View file
x265_1.6.tar.gz/source/encoder/sei.h -> x265_1.7.tar.gz/source/encoder/sei.h
Changed
@@ -71,6 +71,8 @@ DECODED_PICTURE_HASH = 132, SCALABLE_NESTING = 133, REGION_REFRESH_INFO = 134, + MASTERING_DISPLAY_INFO = 137, + CONTENT_LIGHT_LEVEL_INFO = 144, }; virtual PayloadType payloadType() const = 0; @@ -111,6 +113,73 @@ } }; +class SEIMasteringDisplayColorVolume : public SEI +{ +public: + + uint16_t displayPrimaryX[3]; + uint16_t displayPrimaryY[3]; + uint16_t whitePointX, whitePointY; + uint32_t maxDisplayMasteringLuminance; + uint32_t minDisplayMasteringLuminance; + + PayloadType payloadType() const { return MASTERING_DISPLAY_INFO; } + + bool parse(const char* value) + { + return sscanf(value, "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)", + &displayPrimaryX[0], &displayPrimaryY[0], + &displayPrimaryX[1], &displayPrimaryY[1], + &displayPrimaryX[2], &displayPrimaryY[2], + &whitePointX, &whitePointY, + &maxDisplayMasteringLuminance, &minDisplayMasteringLuminance) == 10; + } + + void write(Bitstream& bs, const SPS&) + { + m_bitIf = &bs; + + WRITE_CODE(MASTERING_DISPLAY_INFO, 8, "payload_type"); + WRITE_CODE(8 * 2 + 2 * 4, 8, "payload_size"); + + for (uint32_t i = 0; i < 3; i++) + { + WRITE_CODE(displayPrimaryX[i], 16, "display_primaries_x[ c ]"); + WRITE_CODE(displayPrimaryY[i], 16, "display_primaries_y[ c ]"); + } + WRITE_CODE(whitePointX, 16, "white_point_x"); + WRITE_CODE(whitePointY, 16, "white_point_y"); + WRITE_CODE(maxDisplayMasteringLuminance, 32, "max_display_mastering_luminance"); + WRITE_CODE(minDisplayMasteringLuminance, 32, "min_display_mastering_luminance"); + } +}; + +class SEIContentLightLevel : public SEI +{ +public: + + uint16_t max_content_light_level; + uint16_t max_pic_average_light_level; + + PayloadType payloadType() const { return CONTENT_LIGHT_LEVEL_INFO; } + + bool parse(const char* value) + { + return sscanf(value, "%hu,%hu", + &max_content_light_level, &max_pic_average_light_level) == 2; + } + + void write(Bitstream& bs, const SPS&) + { + m_bitIf = &bs; + + WRITE_CODE(CONTENT_LIGHT_LEVEL_INFO, 8, "payload_type"); + WRITE_CODE(4, 8, "payload_size"); + WRITE_CODE(max_content_light_level, 16, "max_content_light_level"); + WRITE_CODE(max_pic_average_light_level, 16, "max_pic_average_light_level"); + } +}; + class SEIDecodedPictureHash : public SEI { public:
View file
x265_1.6.tar.gz/source/encoder/slicetype.cpp -> x265_1.7.tar.gz/source/encoder/slicetype.cpp
Changed
@@ -44,23 +44,6 @@ namespace { -inline int16_t median(int16_t a, int16_t b, int16_t c) -{ - int16_t t = (a - b) & ((a - b) >> 31); - - a -= t; - b += t; - b -= (b - c) & ((b - c) >> 31); - b += (a - b) & ((a - b) >> 31); - return b; -} - -inline void median_mv(MV &dst, MV a, MV b, MV c) -{ - dst.x = median(a.x, b.x, c.x); - dst.y = median(a.y, b.y, c.y); -} - /* Compute variance to derive AC energy of each block */ inline uint32_t acEnergyVar(Frame *curFrame, uint64_t sum_ssd, int shift, int plane) { @@ -492,8 +475,6 @@ m_8x8Blocks = m_8x8Width > 2 && m_8x8Height > 2 ? (m_8x8Width - 2) * (m_8x8Height - 2) : m_8x8Width * m_8x8Height; m_lastKeyframe = -m_param->keyframeMax; - memset(m_preframes, 0, sizeof(m_preframes)); - m_preTotal = m_preAcquired = m_preCompleted = 0; m_sliceTypeBusy = false; m_fullQueueSize = X265_MAX(1, m_param->lookaheadDepth); m_bAdaptiveQuant = m_param->rc.aqMode || m_param->bEnableWeightedPred || m_param->bEnableWeightedBiPred; @@ -568,14 +549,14 @@ return m_tld && m_scratch; } -void Lookahead::stop() +void Lookahead::stopJobs() { if (m_pool && !m_inputQueue.empty()) { - m_preLookaheadLock.acquire(); + m_inputLock.acquire(); m_isActive = false; bool wait = m_outputSignalRequired = m_sliceTypeBusy; - m_preLookaheadLock.release(); + m_inputLock.release(); if (wait) m_outputSignal.wait(); @@ -634,19 +615,11 @@ m_filled = true; /* full capacity plus mini-gop lag */ } - m_preLookaheadLock.acquire(); - m_inputLock.acquire(); m_inputQueue.pushBack(curFrame); - m_inputLock.release(); - - m_preframes[m_preTotal++] = &curFrame; - X265_CHECK(m_preTotal <= X265_LOOKAHEAD_MAX, "prelookahead overflow\n"); - - m_preLookaheadLock.release(); - - if (m_pool) + if (m_pool && m_inputQueue.size() >= m_fullQueueSize) tryWakeOne(); + m_inputLock.release(); } /* Called by API thread */ @@ -657,74 +630,33 @@ m_filled = true; } -void Lookahead::findJob(int workerThreadID) +void Lookahead::findJob(int /*workerThreadID*/) { - Frame* preFrame; - bool doDecide; - - if (!m_isActive) - return; - - int tld = workerThreadID; - if (workerThreadID < 0) - tld = m_pool ? m_pool->m_numWorkers : 0; + bool doDecide; - m_preLookaheadLock.acquire(); - do - { - preFrame = NULL; - doDecide = false; + m_inputLock.acquire(); + if (m_inputQueue.size() >= m_fullQueueSize && !m_sliceTypeBusy && m_isActive) + doDecide = m_sliceTypeBusy = true; + else + doDecide = m_helpWanted = false; + m_inputLock.release(); - if (m_preTotal > m_preAcquired) - preFrame = m_preframes[m_preAcquired++]; - else - { - if (m_preTotal == m_preCompleted) - m_preAcquired = m_preTotal = m_preCompleted = 0; - - /* the worker thread that performs the last pre-lookahead will generally get to run - * slicetypeDecide() */ - m_inputLock.acquire(); - if (!m_sliceTypeBusy && !m_preTotal && m_inputQueue.size() >= m_fullQueueSize && m_isActive) - doDecide = m_sliceTypeBusy = true; - else - m_helpWanted = false; - m_inputLock.release(); - } - m_preLookaheadLock.release(); + if (!doDecide) + return; - if (preFrame) - { - ProfileLookaheadTime(m_preLookaheadElapsedTime, m_countPreLookahead); - ProfileScopeEvent(prelookahead); - - preFrame->m_lowres.init(preFrame->m_fencPic, preFrame->m_poc); - if (m_param->rc.bStatRead && m_param->rc.cuTree && IS_REFERENCED(preFrame)) - /* cu-tree offsets were read from stats file */; - else if (m_bAdaptiveQuant) - m_tld[tld].calcAdaptiveQuantFrame(preFrame, m_param); - m_tld[tld].lowresIntraEstimate(preFrame->m_lowres); - - m_preLookaheadLock.acquire(); /* re-acquire for next pass */ - m_preCompleted++; - } - else if (doDecide) - { - ProfileLookaheadTime(m_slicetypeDecideElapsedTime, m_countSlicetypeDecide); - ProfileScopeEvent(slicetypeDecideEV); + ProfileLookaheadTime(m_slicetypeDecideElapsedTime, m_countSlicetypeDecide); + ProfileScopeEvent(slicetypeDecideEV); - slicetypeDecide(); + slicetypeDecide(); - m_preLookaheadLock.acquire(); /* re-acquire for next pass */ - if (m_outputSignalRequired) - { - m_outputSignal.trigger(); - m_outputSignalRequired = false; - } - m_sliceTypeBusy = false; - } + m_inputLock.acquire(); + if (m_outputSignalRequired) + { + m_outputSignal.trigger(); + m_outputSignalRequired = false; } - while (preFrame || doDecide); + m_sliceTypeBusy = false; + m_inputLock.release(); } /* Called by API thread */ @@ -739,13 +671,11 @@ if (out) return out; - /* process all pending pre-lookahead frames and run slicetypeDecide() if - * necessary */ - findJob(-1); + findJob(-1); /* run slicetypeDecide() if necessary */ - m_preLookaheadLock.acquire(); - bool wait = m_outputSignalRequired = m_sliceTypeBusy || m_preTotal; - m_preLookaheadLock.release(); + m_inputLock.acquire(); + bool wait = m_outputSignalRequired = m_sliceTypeBusy; + m_inputLock.release(); if (wait) m_outputSignal.wait(); @@ -809,7 +739,7 @@ { /* aggregate lowres row satds to CTU resolution */ curFrame->m_lowres.lowresCostForRc = curFrame->m_lowres.lowresCosts[b - p0][p1 - b]; - uint32_t lowresRow = 0, lowresCol = 0, lowresCuIdx = 0, sum = 0; + uint32_t lowresRow = 0, lowresCol = 0, lowresCuIdx = 0, sum = 0, intraSum = 0; uint32_t scale = m_param->maxCUSize / (2 * X265_LOWRES_CU_SIZE); uint32_t numCuInHeight = (m_param->sourceHeight + g_maxCUSize - 1) / g_maxCUSize; uint32_t widthInLowresCu = (uint32_t)m_8x8Width, heightInLowresCu = (uint32_t)m_8x8Height; @@ -823,7 +753,7 @@ lowresRow = row * scale; for (uint32_t cnt = 0; cnt < scale && lowresRow < heightInLowresCu; lowresRow++, cnt++) { - sum = 0; + sum = 0; intraSum = 0; lowresCuIdx = lowresRow * widthInLowresCu; for (lowresCol = 0; lowresCol < widthInLowresCu; lowresCol++, lowresCuIdx++) { @@ -836,24 +766,57 @@ } curFrame->m_lowres.lowresCostForRc[lowresCuIdx] = lowresCuCost; sum += lowresCuCost; + intraSum += curFrame->m_lowres.intraCost[lowresCuIdx]; } curFrame->m_encData->m_rowStat[row].satdForVbv += sum; + curFrame->m_encData->m_rowStat[row].intraSatdForVbv += intraSum; } } } } +void PreLookaheadGroup::processTasks(int workerThreadID) +{ + if (workerThreadID < 0) + workerThreadID = m_lookahead.m_pool ? m_lookahead.m_pool->m_numWorkers : 0; + LookaheadTLD& tld = m_lookahead.m_tld[workerThreadID]; + + m_lock.acquire(); + while (m_jobAcquired < m_jobTotal) + { + Frame* preFrame = m_preframes[m_jobAcquired++]; + ProfileLookaheadTime(m_lookahead.m_preLookaheadElapsedTime, m_lookahead.m_countPreLookahead); + ProfileScopeEvent(prelookahead); + m_lock.release(); + + preFrame->m_lowres.init(preFrame->m_fencPic, preFrame->m_poc); + if (m_lookahead.m_param->rc.bStatRead && m_lookahead.m_param->rc.cuTree && IS_REFERENCED(preFrame)) + /* cu-tree offsets were read from stats file */; + else if (m_lookahead.m_bAdaptiveQuant) + tld.calcAdaptiveQuantFrame(preFrame, m_lookahead.m_param); + tld.lowresIntraEstimate(preFrame->m_lowres); + preFrame->m_lowresInit = true; + + m_lock.acquire(); + } + m_lock.release(); +} + /* called by API thread or worker thread with inputQueueLock acquired */ void Lookahead::slicetypeDecide() { - Lowres *frames[X265_LOOKAHEAD_MAX]; - Frame *list[X265_LOOKAHEAD_MAX]; - int maxSearch = X265_MIN(m_param->lookaheadDepth, X265_LOOKAHEAD_MAX); + PreLookaheadGroup pre(*this); + Lowres* frames[X265_LOOKAHEAD_MAX + X265_BFRAME_MAX + 4]; + Frame* list[X265_BFRAME_MAX + 4]; memset(frames, 0, sizeof(frames)); memset(list, 0, sizeof(list)); + int maxSearch = X265_MIN(m_param->lookaheadDepth, X265_LOOKAHEAD_MAX); + maxSearch = X265_MAX(1, maxSearch); + { ScopedLock lock(m_inputLock); + Frame *curFrame = m_inputQueue.first(); int j; for (j = 0; j < m_param->bframes + 2; j++) @@ -869,13 +832,25 @@ { if (!curFrame) break; frames[j + 1] = &curFrame->m_lowres; - X265_CHECK(curFrame->m_lowres.costEst[0][0] > 0, "prelookahead not completed for input picture\n"); + + if (!curFrame->m_lowresInit) + pre.m_preframes[pre.m_jobTotal++] = curFrame; + curFrame = curFrame->m_next; } maxSearch = j; } + /* perform pre-analysis on frames which need it, using a bonded task group */ + if (pre.m_jobTotal) + { + if (m_pool) + pre.tryBondPeers(*m_pool, pre.m_jobTotal); + pre.processTasks(-1); + pre.waitForExit(); + } + if (m_lastNonB && !m_param->rc.bStatRead && ((m_param->bFrameAdaptive && m_param->bframes) || m_param->rc.cuTree || m_param->scenecutThreshold || @@ -2038,12 +2013,10 @@ int numc = 0; MV mvc[4], mvp; - MV* fencMV = &fenc->lowresMvs[i][listDist[i]][cuXY]; + ReferencePlanes* fref = i ? fref1 : wfref0; /* Reverse-order MV prediction */ - mvc[0] = 0; - mvc[2] = 0; #define MVC(mv) mvc[numc++] = mv; if (cuX < widthInCU - 1) MVC(fencMV[1]); @@ -2056,12 +2029,29 @@ MVC(fencMV[widthInCU + 1]); } #undef MVC - if (numc <= 1) - mvp = mvc[0]; + + if (!numc) + mvp = 0; else - median_mv(mvp, mvc[0], mvc[1], mvc[2]); + { + ALIGN_VAR_32(pixel, subpelbuf[X265_LOWRES_CU_SIZE * X265_LOWRES_CU_SIZE]); + int mvpcost = MotionEstimate::COST_MAX; + + /* measure SATD cost of each neighbor MV (estimating merge analysis) + * and use the lowest cost MV as MVP (estimating AMVP). Since all + * mvc[] candidates are measured here, none are passed to motionEstimate */ + for (int idx = 0; idx < numc; idx++) + { + intptr_t stride = X265_LOWRES_CU_SIZE; + pixel *src = fref->lowresMC(pelOffset, mvc[idx], subpelbuf, stride); + int cost = tld.me.bufSATD(src, stride); + COPY2_IF_LT(mvpcost, cost, mvp, mvc[idx]); + } + } - fencCost = tld.me.motionEstimate(i ? fref1 : wfref0, mvmin, mvmax, mvp, numc, mvc, s_merange, *fencMV); + /* ME will never return a cost larger than the cost @MVP, so we do not + * have to check that ME cost is more than the estimated merge cost */ + fencCost = tld.me.motionEstimate(fref, mvmin, mvmax, mvp, 0, NULL, s_merange, *fencMV); COPY2_IF_LT(bcost, fencCost, listused, i + 1); }
View file
x265_1.6.tar.gz/source/encoder/slicetype.h -> x265_1.7.tar.gz/source/encoder/slicetype.h
Changed
@@ -105,8 +105,6 @@ Lock m_outputLock; /* pre-lookahead */ - Frame* m_preframes[X265_LOOKAHEAD_MAX]; - int m_preTotal, m_preAcquired, m_preCompleted; int m_fullQueueSize; bool m_isActive; bool m_sliceTypeBusy; @@ -114,7 +112,6 @@ bool m_outputSignalRequired; bool m_bBatchMotionSearch; bool m_bBatchFrameCosts; - Lock m_preLookaheadLock; Event m_outputSignal; LookaheadTLD* m_tld; @@ -143,7 +140,7 @@ bool create(); void destroy(); - void stop(); + void stopJobs(); void addPicture(Frame&, int sliceType); void flush(); @@ -176,6 +173,22 @@ int64_t frameCostRecalculate(Lowres **frames, int p0, int p1, int b); }; +class PreLookaheadGroup : public BondedTaskGroup +{ +public: + + Frame* m_preframes[X265_LOOKAHEAD_MAX]; + Lookahead& m_lookahead; + + PreLookaheadGroup(Lookahead& l) : m_lookahead(l) {} + + void processTasks(int workerThreadID); + +protected: + + PreLookaheadGroup& operator=(const PreLookaheadGroup&); +}; + class CostEstimateGroup : public BondedTaskGroup { public:
View file
x265_1.6.tar.gz/source/input/input.cpp -> x265_1.7.tar.gz/source/input/input.cpp
Changed
@@ -27,7 +27,7 @@ using namespace x265; -Input* Input::open(InputFileInfo& info, bool bForceY4m) +InputFile* InputFile::open(InputFileInfo& info, bool bForceY4m) { const char * s = strrchr(info.filename, '.');
View file
x265_1.6.tar.gz/source/input/input.h -> x265_1.7.tar.gz/source/input/input.h
Changed
@@ -48,23 +48,25 @@ int sarWidth; int sarHeight; int frameCount; + int timebaseNum; + int timebaseDenom; /* user supplied */ int skipFrames; const char *filename; }; -class Input +class InputFile { protected: - virtual ~Input() {} + virtual ~InputFile() {} public: - Input() {} + InputFile() {} - static Input* open(InputFileInfo& info, bool bForceY4m); + static InputFile* open(InputFileInfo& info, bool bForceY4m); virtual void startReader() = 0;
View file
x265_1.6.tar.gz/source/input/y4m.cpp -> x265_1.7.tar.gz/source/input/y4m.cpp
Changed
@@ -46,9 +46,6 @@ for (int i = 0; i < QUEUE_SIZE; i++) buf[i] = NULL; - readCount.set(0); - writeCount.set(0); - threadActive = false; colorSpace = info.csp; sarWidth = info.sarWidth; @@ -164,7 +161,7 @@ void Y4MInput::release() { threadActive = false; - readCount.set(readCount.get()); // unblock file reader + readCount.poke(); stop(); delete this; } @@ -352,7 +349,7 @@ while (threadActive); threadActive = false; - writeCount.set(writeCount.get()); // unblock readPicture + writeCount.poke(); } bool Y4MInput::populateFrameQueue()
View file
x265_1.6.tar.gz/source/input/y4m.h -> x265_1.7.tar.gz/source/input/y4m.h
Changed
@@ -33,7 +33,7 @@ namespace x265 { // x265 private namespace -class Y4MInput : public Input, public Thread +class Y4MInput : public InputFile, public Thread { protected:
View file
x265_1.6.tar.gz/source/input/yuv.cpp -> x265_1.7.tar.gz/source/input/yuv.cpp
Changed
@@ -44,8 +44,6 @@ for (int i = 0; i < QUEUE_SIZE; i++) buf[i] = NULL; - readCount.set(0); - writeCount.set(0); depth = info.depth; width = info.width; height = info.height; @@ -152,7 +150,7 @@ void YUVInput::release() { threadActive = false; - readCount.set(readCount.get()); // unblock read thread + readCount.poke(); stop(); delete this; } @@ -175,7 +173,7 @@ } threadActive = false; - writeCount.set(writeCount.get()); // unblock readPicture + writeCount.poke(); } bool YUVInput::populateFrameQueue()
View file
x265_1.6.tar.gz/source/input/yuv.h -> x265_1.7.tar.gz/source/input/yuv.h
Changed
@@ -33,7 +33,7 @@ namespace x265 { // private x265 namespace -class YUVInput : public Input, public Thread +class YUVInput : public InputFile, public Thread { protected:
View file
x265_1.6.tar.gz/source/output/output.cpp -> x265_1.7.tar.gz/source/output/output.cpp
Changed
@@ -1,7 +1,8 @@ /***************************************************************************** - * Copyright (C) 2013 x265 project + * Copyright (C) 2013-2015 x265 project * * Authors: Steve Borho <steve@borho.org> + * Xinyue Lu <i@7086.in> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -25,9 +26,11 @@ #include "yuv.h" #include "y4m.h" +#include "raw.h" + using namespace x265; -Output* Output::open(const char *fname, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp) +ReconFile* ReconFile::open(const char *fname, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp) { const char * s = strrchr(fname, '.'); @@ -36,3 +39,8 @@ else return new YUVOutput(fname, width, height, bitdepth, csp); } + +OutputFile* OutputFile::open(const char *fname, InputFileInfo& inputInfo) +{ + return new RAWOutput(fname, inputInfo); +}
View file
x265_1.6.tar.gz/source/output/output.h -> x265_1.7.tar.gz/source/output/output.h
Changed
@@ -1,7 +1,8 @@ /***************************************************************************** - * Copyright (C) 2013 x265 project + * Copyright (C) 2013-2015 x265 project * * Authors: Steve Borho <steve@borho.org> + * Xinyue Lu <i@7086.in> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -25,22 +26,23 @@ #define X265_OUTPUT_H #include "x265.h" +#include "input/input.h" namespace x265 { // private x265 namespace -class Output +class ReconFile { protected: - virtual ~Output() {} + virtual ~ReconFile() {} public: - Output() {} + ReconFile() {} - static Output* open(const char *fname, int width, int height, uint32_t bitdepth, - uint32_t fpsNum, uint32_t fpsDenom, int csp); + static ReconFile* open(const char *fname, int width, int height, uint32_t bitdepth, + uint32_t fpsNum, uint32_t fpsDenom, int csp); virtual bool isFail() const = 0; @@ -50,6 +52,35 @@ virtual const char *getName() const = 0; }; + +class OutputFile +{ +protected: + + virtual ~OutputFile() {} + +public: + + OutputFile() {} + + static OutputFile* open(const char* fname, InputFileInfo& inputInfo); + + virtual bool isFail() const = 0; + + virtual bool needPTS() const = 0; + + virtual void release() = 0; + + virtual const char* getName() const = 0; + + virtual void setParam(x265_param* param) = 0; + + virtual int writeHeaders(const x265_nal* nal, uint32_t nalcount) = 0; + + virtual int writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture& pic) = 0; + + virtual void closeFile(int64_t largest_pts, int64_t second_largest_pts) = 0; +}; } #endif // ifndef X265_OUTPUT_H
View file
x265_1.7.tar.gz/source/output/raw.cpp
Added
@@ -0,0 +1,80 @@ +/***************************************************************************** + * Copyright (C) 2013-2015 x265 project + * + * Authors: Steve Borho <steve@borho.org> + * Xinyue Lu <i@7086.in> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "raw.h" + +using namespace x265; +using namespace std; + +RAWOutput::RAWOutput(const char* fname, InputFileInfo&) +{ + b_fail = false; + if (!strcmp(fname, "-")) + { + ofs = &cout; + return; + } + ofs = new ofstream(fname, ios::binary | ios::out); + if (ofs->fail()) + b_fail = true; +} + +void RAWOutput::setParam(x265_param* param) +{ + param->bAnnexB = true; +} + +int RAWOutput::writeHeaders(const x265_nal* nal, uint32_t nalcount) +{ + uint32_t bytes = 0; + + for (uint32_t i = 0; i < nalcount; i++) + { + ofs->write((const char*)nal->payload, nal->sizeBytes); + bytes += nal->sizeBytes; + nal++; + } + + return bytes; +} + +int RAWOutput::writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture&) +{ + uint32_t bytes = 0; + + for (uint32_t i = 0; i < nalcount; i++) + { + ofs->write((const char*)nal->payload, nal->sizeBytes); + bytes += nal->sizeBytes; + nal++; + } + + return bytes; +} + +void RAWOutput::closeFile(int64_t, int64_t) +{ + if (ofs != &cout) + delete ofs; +}
View file
x265_1.7.tar.gz/source/output/raw.h
Added
@@ -0,0 +1,64 @@ +/***************************************************************************** + * Copyright (C) 2013-2015 x265 project + * + * Authors: Steve Borho <steve@borho.org> + * Xinyue Lu <i@7086.in> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#ifndef X265_HEVC_RAW_H +#define X265_HEVC_RAW_H + +#include "output.h" +#include "common.h" +#include <fstream> +#include <iostream> + +namespace x265 { +class RAWOutput : public OutputFile +{ +protected: + + std::ostream* ofs; + + bool b_fail; + +public: + + RAWOutput(const char* fname, InputFileInfo&); + + bool isFail() const { return b_fail; } + + bool needPTS() const { return false; } + + void release() { delete this; } + + const char* getName() const { return "raw"; } + + void setParam(x265_param* param); + + int writeHeaders(const x265_nal* nal, uint32_t nalcount); + + int writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture&); + + void closeFile(int64_t largest_pts, int64_t second_largest_pts); +}; +} + +#endif // ifndef X265_HEVC_RAW_H
View file
x265_1.7.tar.gz/source/output/reconplay.cpp
Added
@@ -0,0 +1,197 @@ +/***************************************************************************** + * Copyright (C) 2013 x265 project + * + * Authors: Peixuan Zhang <zhangpeixuancn@gmail.com> + * Chunli Zhang <chunli@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "common.h" +#include "reconplay.h" + +#include <signal.h> + +using namespace x265; + +#if _WIN32 +#define popen _popen +#define pclose _pclose +#define pipemode "wb" +#else +#define pipemode "w" +#endif + +bool ReconPlay::pipeValid; + +#ifndef _WIN32 +static void sigpipe_handler(int) +{ + if (ReconPlay::pipeValid) + general_log(NULL, "exec", X265_LOG_ERROR, "pipe closed\n"); + ReconPlay::pipeValid = false; +} +#endif + +ReconPlay::ReconPlay(const char* commandLine, x265_param& param) +{ +#ifndef _WIN32 + if (signal(SIGPIPE, sigpipe_handler) == SIG_ERR) + general_log(¶m, "exec", X265_LOG_ERROR, "Unable to register SIGPIPE handler: %s\n", strerror(errno)); +#endif + + width = param.sourceWidth; + height = param.sourceHeight; + colorSpace = param.internalCsp; + + frameSize = 0; + for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++) + frameSize += (uint32_t)((width >> x265_cli_csps[colorSpace].width[i]) * (height >> x265_cli_csps[colorSpace].height[i])); + + for (int i = 0; i < RECON_BUF_SIZE; i++) + { + poc[i] = -1; + CHECKED_MALLOC(frameData[i], pixel, frameSize); + } + + outputPipe = popen(commandLine, pipemode); + if (outputPipe) + { + const char* csp = (colorSpace >= X265_CSP_I444) ? "444" : (colorSpace >= X265_CSP_I422) ? "422" : "420"; + const char* depth = (param.internalBitDepth == 10) ? "p10" : ""; + + fprintf(outputPipe, "YUV4MPEG2 W%d H%d F%d:%d Ip C%s%s\n", width, height, param.fpsNum, param.fpsDenom, csp, depth); + + pipeValid = true; + threadActive = true; + start(); + return; + } + else + general_log(¶m, "exec", X265_LOG_ERROR, "popen(%s) failed\n", commandLine); + +fail: + threadActive = false; +} + +ReconPlay::~ReconPlay() +{ + if (threadActive) + { + threadActive = false; + writeCount.poke(); + stop(); + } + + if (outputPipe) + pclose(outputPipe); + + for (int i = 0; i < RECON_BUF_SIZE; i++) + X265_FREE(frameData[i]); +} + +bool ReconPlay::writePicture(const x265_picture& pic) +{ + if (!threadActive || !pipeValid) + return false; + + int written = writeCount.get(); + int read = readCount.get(); + int currentCursor = pic.poc % RECON_BUF_SIZE; + + /* TODO: it's probably better to drop recon pictures when the ring buffer is + * backed up on the display app */ + while (written - read > RECON_BUF_SIZE - 2 || poc[currentCursor] != -1) + { + read = readCount.waitForChange(read); + if (!threadActive) + return false; + } + + X265_CHECK(pic.colorSpace == colorSpace, "invalid color space\n"); + X265_CHECK(pic.bitDepth == X265_DEPTH, "invalid bit depth\n"); + + pixel* buf = frameData[currentCursor]; + for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++) + { + char* src = (char*)pic.planes[i]; + int pwidth = width >> x265_cli_csps[colorSpace].width[i]; + + for (int h = 0; h < height >> x265_cli_csps[colorSpace].height[i]; h++) + { + memcpy(buf, src, pwidth * sizeof(pixel)); + src += pic.stride[i]; + buf += pwidth; + } + } + + poc[currentCursor] = pic.poc; + writeCount.incr(); + + return true; +} + +void ReconPlay::threadMain() +{ + THREAD_NAME("ReconPlayOutput", 0); + + do + { + /* extract the next output picture in display order and write to pipe */ + if (!outputFrame()) + break; + } + while (threadActive); + + threadActive = false; + readCount.poke(); +} + +bool ReconPlay::outputFrame() +{ + int written = writeCount.get(); + int read = readCount.get(); + int currentCursor = read % RECON_BUF_SIZE; + + while (poc[currentCursor] != read) + { + written = writeCount.waitForChange(written); + if (!threadActive) + return false; + } + + char* buf = (char*)frameData[currentCursor]; + intptr_t remainSize = frameSize * sizeof(pixel); + + fprintf(outputPipe, "FRAME\n"); + while (remainSize > 0) + { + intptr_t retCount = (intptr_t)fwrite(buf, sizeof(char), remainSize, outputPipe); + + if (retCount < 0 || !pipeValid) + /* pipe failure, stop writing and start dropping recon pictures */ + return false; + + buf += retCount; + remainSize -= retCount; + } + + poc[currentCursor] = -1; + readCount.incr(); + return true; +}
View file
x265_1.7.tar.gz/source/output/reconplay.h
Added
@@ -0,0 +1,74 @@ +/***************************************************************************** + * Copyright (C) 2013 x265 project + * + * Authors: Peixuan Zhang <zhangpeixuancn@gmail.com> + * Chunli Zhang <chunli@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#ifndef X265_RECONPLAY_H +#define X265_RECONPLAY_H + +#include "x265.h" +#include "threading.h" +#include <cstdio> + +namespace x265 { +// private x265 namespace + +class ReconPlay : public Thread +{ +public: + + ReconPlay(const char* commandLine, x265_param& param); + + virtual ~ReconPlay(); + + bool writePicture(const x265_picture& pic); + + static bool pipeValid; + +protected: + + enum { RECON_BUF_SIZE = 40 }; + + FILE* outputPipe; /* The output pipe for player */ + size_t frameSize; /* size of one frame in pixels */ + bool threadActive; /* worker thread is active */ + int width; /* width of frame */ + int height; /* height of frame */ + int colorSpace; /* color space of frame */ + + int poc[RECON_BUF_SIZE]; + pixel* frameData[RECON_BUF_SIZE]; + + /* Note that the class uses read and write counters to signal that reads and + * writes have occurred in the ring buffer, but writes into the buffer + * happen in decode order and the reader must check that the POC it next + * needs to send to the pipe is in fact present. The counters are used to + * prevent the writer from getting too far ahead of the reader */ + ThreadSafeInteger readCount; + ThreadSafeInteger writeCount; + + void threadMain(); + bool outputFrame(); +}; +} + +#endif // ifndef X265_RECONPLAY_H
View file
x265_1.6.tar.gz/source/output/y4m.h -> x265_1.7.tar.gz/source/output/y4m.h
Changed
@@ -30,7 +30,7 @@ namespace x265 { // private x265 namespace -class Y4MOutput : public Output +class Y4MOutput : public ReconFile { protected:
View file
x265_1.6.tar.gz/source/output/yuv.h -> x265_1.7.tar.gz/source/output/yuv.h
Changed
@@ -32,7 +32,7 @@ namespace x265 { // private x265 namespace -class YUVOutput : public Output +class YUVOutput : public ReconFile { protected:
View file
x265_1.6.tar.gz/source/test/ipfilterharness.cpp -> x265_1.7.tar.gz/source/test/ipfilterharness.cpp
Changed
@@ -61,55 +61,6 @@ } } -bool IPFilterHarness::check_IPFilter_primitive(filter_p2s_wxh_t ref, filter_p2s_wxh_t opt, int isChroma, int csp) -{ - intptr_t rand_srcStride; - int min_size = isChroma ? 2 : 4; - int max_size = isChroma ? (MAX_CU_SIZE >> 1) : MAX_CU_SIZE; - - if (isChroma && (csp == X265_CSP_I444)) - { - min_size = 4; - max_size = MAX_CU_SIZE; - } - - for (int i = 0; i < ITERS; i++) - { - int index = i % TEST_CASES; - int rand_height = (int16_t)rand() % 100; - int rand_width = (int16_t)rand() % 100; - - rand_srcStride = rand_width + rand() % 100; - if (rand_srcStride < rand_width) - rand_srcStride = rand_width; - - rand_width &= ~(min_size - 1); - rand_width = x265_clip3(min_size, max_size, rand_width); - - rand_height &= ~(min_size - 1); - rand_height = x265_clip3(min_size, max_size, rand_height); - - ref(pixel_test_buff[index], - rand_srcStride, - IPF_C_output_s, - rand_width, - rand_height); - - checked(opt, pixel_test_buff[index], - rand_srcStride, - IPF_vec_output_s, - rand_width, - rand_height); - - if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t))) - return false; - - reportfail(); - } - - return true; -} - bool IPFilterHarness::check_IPFilterChroma_primitive(filter_pp_t ref, filter_pp_t opt) { intptr_t rand_srcStride, rand_dstStride; @@ -518,12 +469,13 @@ { intptr_t rand_srcStride = rand() % 100; int index = i % TEST_CASES; + intptr_t dstStride = rand() % 100 + 64; - ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s); + ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s, dstStride); - checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s); + checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s, dstStride); - if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel))) + if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t))) return false; reportfail(); @@ -538,12 +490,13 @@ { intptr_t rand_srcStride = rand() % 100; int index = i % TEST_CASES; + intptr_t dstStride = rand() % 100 + 64; - ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s); + ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s, dstStride); - checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s); + checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s, dstStride); - if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel))) + if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t))) return false; reportfail(); @@ -554,15 +507,6 @@ bool IPFilterHarness::testCorrectness(const EncoderPrimitives& ref, const EncoderPrimitives& opt) { - if (opt.luma_p2s) - { - // last parameter does not matter in case of luma - if (!check_IPFilter_primitive(ref.luma_p2s, opt.luma_p2s, 0, 1)) - { - printf("luma_p2s failed\n"); - return false; - } - } for (int value = 0; value < NUM_PU_SIZES; value++) { @@ -622,11 +566,11 @@ return false; } } - if (opt.pu[value].filter_p2s) + if (opt.pu[value].convert_p2s) { - if (!check_IPFilterLumaP2S_primitive(ref.pu[value].filter_p2s, opt.pu[value].filter_p2s)) + if (!check_IPFilterLumaP2S_primitive(ref.pu[value].convert_p2s, opt.pu[value].convert_p2s)) { - printf("filter_p2s[%s]", lumaPartStr[value]); + printf("convert_p2s[%s]", lumaPartStr[value]); return false; } } @@ -634,14 +578,6 @@ for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++) { - if (opt.chroma[csp].p2s) - { - if (!check_IPFilter_primitive(ref.chroma[csp].p2s, opt.chroma[csp].p2s, 1, csp)) - { - printf("chroma_p2s[%s]", x265_source_csp_names[csp]); - return false; - } - } for (int value = 0; value < NUM_PU_SIZES; value++) { if (opt.chroma[csp].pu[value].filter_hpp) @@ -692,9 +628,9 @@ return false; } } - if (opt.chroma[csp].pu[value].chroma_p2s) + if (opt.chroma[csp].pu[value].p2s) { - if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].chroma_p2s, opt.chroma[csp].pu[value].chroma_p2s)) + if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].p2s, opt.chroma[csp].pu[value].p2s)) { printf("chroma_p2s[%s]", chromaPartStr[csp][value]); return false; @@ -708,19 +644,10 @@ void IPFilterHarness::measureSpeed(const EncoderPrimitives& ref, const EncoderPrimitives& opt) { - int height = 64; - int width = 64; int16_t srcStride = 96; int16_t dstStride = 96; int maxVerticalfilterHalfDistance = 3; - if (opt.luma_p2s) - { - printf("luma_p2s\t"); - REPORT_SPEEDUP(opt.luma_p2s, ref.luma_p2s, - pixel_buff, srcStride, IPF_vec_output_s, width, height); - } - for (int value = 0; value < NUM_PU_SIZES; value++) { if (opt.pu[value].luma_hpp) @@ -777,23 +704,18 @@ pixel_buff + 3 * srcStride, srcStride, IPF_vec_output_p, srcStride, 1, 3); } - if (opt.pu[value].filter_p2s) + if (opt.pu[value].convert_p2s) { - printf("filter_p2s [%s]\t", lumaPartStr[value]); - REPORT_SPEEDUP(opt.pu[value].filter_p2s, ref.pu[value].filter_p2s, - pixel_buff, srcStride, IPF_vec_output_s); + printf("convert_p2s[%s]\t", lumaPartStr[value]); + REPORT_SPEEDUP(opt.pu[value].convert_p2s, ref.pu[value].convert_p2s, + pixel_buff, srcStride, + IPF_vec_output_s, dstStride); } } for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++) { printf("= Color Space %s =\n", x265_source_csp_names[csp]); - if (opt.chroma[csp].p2s) - { - printf("chroma_p2s\t"); - REPORT_SPEEDUP(opt.chroma[csp].p2s, ref.chroma[csp].p2s, - pixel_buff, srcStride, IPF_vec_output_s, width, height); - } for (int value = 0; value < NUM_PU_SIZES; value++) { if (opt.chroma[csp].pu[value].filter_hpp) @@ -836,13 +758,11 @@ short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, IPF_vec_output_s, dstStride, 1); } - - if (opt.chroma[csp].pu[value].chroma_p2s) + if (opt.chroma[csp].pu[value].p2s) { printf("chroma_p2s[%s]\t", chromaPartStr[csp][value]); - REPORT_SPEEDUP(opt.chroma[csp].pu[value].chroma_p2s, ref.chroma[csp].pu[value].chroma_p2s, - pixel_buff, srcStride, - IPF_vec_output_s); + REPORT_SPEEDUP(opt.chroma[csp].pu[value].p2s, ref.chroma[csp].pu[value].p2s, + pixel_buff, srcStride, IPF_vec_output_s, dstStride); } } }
View file
x265_1.6.tar.gz/source/test/ipfilterharness.h -> x265_1.7.tar.gz/source/test/ipfilterharness.h
Changed
@@ -50,7 +50,6 @@ pixel pixel_test_buff[TEST_CASES][TEST_BUF_SIZE]; int16_t short_test_buff[TEST_CASES][TEST_BUF_SIZE]; - bool check_IPFilter_primitive(filter_p2s_wxh_t ref, filter_p2s_wxh_t opt, int isChroma, int csp); bool check_IPFilterChroma_primitive(filter_pp_t ref, filter_pp_t opt); bool check_IPFilterChroma_ps_primitive(filter_ps_t ref, filter_ps_t opt); bool check_IPFilterChroma_hps_primitive(filter_hps_t ref, filter_hps_t opt);
View file
x265_1.6.tar.gz/source/test/pixelharness.cpp -> x265_1.7.tar.gz/source/test/pixelharness.cpp
Changed
@@ -666,7 +666,32 @@ return true; } -bool PixelHarness::check_scale_pp(scale_t ref, scale_t opt) +bool PixelHarness::check_scale1D_pp(scale1D_t ref, scale1D_t opt) +{ + ALIGN_VAR_16(pixel, ref_dest[64 * 64]); + ALIGN_VAR_16(pixel, opt_dest[64 * 64]); + + memset(ref_dest, 0, sizeof(ref_dest)); + memset(opt_dest, 0, sizeof(opt_dest)); + + int j = 0; + for (int i = 0; i < ITERS; i++) + { + int index = i % TEST_CASES; + checked(opt, opt_dest, pixel_test_buff[index] + j); + ref(ref_dest, pixel_test_buff[index] + j); + + if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel))) + return false; + + reportfail(); + j += INCR; + } + + return true; +} + +bool PixelHarness::check_scale2D_pp(scale2D_t ref, scale2D_t opt) { ALIGN_VAR_16(pixel, ref_dest[64 * 64]); ALIGN_VAR_16(pixel, opt_dest[64 * 64]); @@ -845,8 +870,8 @@ bool PixelHarness::check_calSign(sign_t ref, sign_t opt) { - ALIGN_VAR_16(int8_t, ref_dest[64 * 64]); - ALIGN_VAR_16(int8_t, opt_dest[64 * 64]); + ALIGN_VAR_16(int8_t, ref_dest[64 * 2]); + ALIGN_VAR_16(int8_t, opt_dest[64 * 2]); memset(ref_dest, 0xCD, sizeof(ref_dest)); memset(opt_dest, 0xCD, sizeof(opt_dest)); @@ -855,12 +880,12 @@ for (int i = 0; i < ITERS; i++) { - int width = 16 * (rand() % 4 + 1); + int width = (rand() % 64) + 1; ref(ref_dest, pbuf2 + j, pbuf3 + j, width); checked(opt, opt_dest, pbuf2 + j, pbuf3 + j, width); - if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(int8_t))) + if (memcmp(ref_dest, opt_dest, sizeof(ref_dest))) return false; reportfail(); @@ -883,12 +908,10 @@ for (int i = 0; i < ITERS; i++) { int width = 16 * (rand() % 4 + 1); - int8_t sign = rand() % 3; - if (sign == 2) - sign = -1; + int stride = width + 1; - ref(ref_dest, psbuf1 + j, width, sign); - checked(opt, opt_dest, psbuf1 + j, width, sign); + ref(ref_dest, psbuf1 + j, width, psbuf2 + j, stride); + checked(opt, opt_dest, psbuf1 + j, width, psbuf5 + j, stride); if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel))) return false; @@ -928,7 +951,43 @@ return true; } -bool PixelHarness::check_saoCuOrgE2_t(saoCuOrgE2_t ref, saoCuOrgE2_t opt) +bool PixelHarness::check_saoCuOrgE2_t(saoCuOrgE2_t ref[2], saoCuOrgE2_t opt[2]) +{ + ALIGN_VAR_16(pixel, ref_dest[64 * 64]); + ALIGN_VAR_16(pixel, opt_dest[64 * 64]); + + memset(ref_dest, 0xCD, sizeof(ref_dest)); + memset(opt_dest, 0xCD, sizeof(opt_dest)); + + for (int id = 0; id < 2; id++) + { + int j = 0; + if (opt[id]) + { + for (int i = 0; i < ITERS; i++) + { + int width = 16 * (1 << (id * (rand() % 2 + 1))) - (rand() % 2); + int stride = width + 1; + + ref[width > 16](ref_dest, psbuf1 + j, psbuf2 + j, psbuf3 + j, width, stride); + checked(opt[width > 16], opt_dest, psbuf4 + j, psbuf2 + j, psbuf3 + j, width, stride); + + if (memcmp(psbuf1 + j, psbuf4 + j, width * sizeof(int8_t))) + return false; + + if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel))) + return false; + + reportfail(); + j += INCR; + } + } + } + + return true; +} + +bool PixelHarness::check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt) { ALIGN_VAR_16(pixel, ref_dest[64 * 64]); ALIGN_VAR_16(pixel, opt_dest[64 * 64]); @@ -940,16 +999,14 @@ for (int i = 0; i < ITERS; i++) { - int width = 16 * (rand() % 4 + 1); - int stride = width + 1; - - ref(ref_dest, psbuf1 + j, psbuf2 + j, psbuf3 + j, width, stride); - checked(opt, opt_dest, psbuf4 + j, psbuf2 + j, psbuf3 + j, width, stride); + int stride = 16 * (rand() % 4 + 1); + int start = rand() % 2; + int end = 16 - rand() % 2; - if (memcmp(psbuf1 + j, psbuf4 + j, width * sizeof(int8_t))) - return false; + ref(ref_dest, psbuf2 + j, psbuf1 + j, stride, start, end); + checked(opt, opt_dest, psbuf5 + j, psbuf1 + j, stride, start, end); - if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel))) + if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)) || memcmp(psbuf2, psbuf5, BUFFSIZE)) return false; reportfail(); @@ -959,7 +1016,7 @@ return true; } -bool PixelHarness::check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt) +bool PixelHarness::check_saoCuOrgE3_32_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt) { ALIGN_VAR_16(pixel, ref_dest[64 * 64]); ALIGN_VAR_16(pixel, opt_dest[64 * 64]); @@ -971,9 +1028,9 @@ for (int i = 0; i < ITERS; i++) { - int stride = 16 * (rand() % 4 + 1); + int stride = 32 * (rand() % 2 + 1); int start = rand() % 2; - int end = (16 * (rand() % 4 + 1)) - rand() % 2; + int end = (32 * (rand() % 2 + 1)) - rand() % 2; ref(ref_dest, psbuf2 + j, psbuf1 + j, stride, start, end); checked(opt, opt_dest, psbuf5 + j, psbuf1 + j, stride, start, end); @@ -995,9 +1052,8 @@ memset(ref_dest, 0xCD, sizeof(ref_dest)); memset(opt_dest, 0xCD, sizeof(opt_dest)); - - int width = 16 + rand() % 48; - int height = 16 + rand() % 48; + int width = 32 + rand() % 32; + int height = 32 + rand() % 32; intptr_t srcStride = 64; intptr_t dstStride = width; int j = 0; @@ -1133,8 +1189,8 @@ for (int i = 0; i < ITERS; i++) { int width = 16 * (rand() % 4 + 1); - int height = rand() % 64 +1; - int stride = rand() % 65; + int height = rand() % 63 + 2; + int stride = width; ref(ref_dest, psbuf1 + j, width, height, stride); checked(opt, opt_dest, psbuf1 + j, width, height, stride); @@ -1149,7 +1205,7 @@ return true; } -bool PixelHarness::check_findPosLast(findPosLast_t ref, findPosLast_t opt) +bool PixelHarness::check_scanPosLast(scanPosLast_t ref, scanPosLast_t opt) { ALIGN_VAR_16(coeff_t, ref_src[32 * 32 + ITERS * 2]); uint8_t ref_coeffNum[MLS_GRP_NUM], opt_coeffNum[MLS_GRP_NUM]; // value range[0, 16] @@ -1160,6 +1216,14 @@ for (int i = 0; i < 32 * 32; i++) { ref_src[i] = rand() & SHORT_MAX; + + // more zero coeff + if (ref_src[i] < SHORT_MAX * 2 / 3) + ref_src[i] = 0; + + // more negtive + if ((rand() % 10) < 8) + ref_src[i] *= -1; totalCoeffs += (ref_src[i] != 0); } @@ -1187,10 +1251,19 @@ for (int j = 0; j < 1 << (2 * (rand_scan_size + 2)); j++) rand_numCoeff += (ref_src[i + j] != 0); + // at least one coeff in transform block + if (rand_numCoeff == 0) + { + ref_src[i + (1 << (2 * (rand_scan_size + 2))) - 1] = -1; + rand_numCoeff = 1; + } + + const int trSize = (1 << (rand_scan_size + 2)); const uint16_t* const scanTbl = g_scanOrder[rand_scan_type][rand_scan_size]; + const uint16_t* const scanTblCG4x4 = g_scan4x4[rand_scan_size <= (MDCS_LOG2_MAX_SIZE - 2) ? rand_scan_type : SCAN_DIAG]; - int ref_scanPos = ref(scanTbl, ref_src + i, ref_coeffSign, ref_coeffFlag, ref_coeffNum, rand_numCoeff); - int opt_scanPos = (int)checked(opt, scanTbl, ref_src + i, opt_coeffSign, opt_coeffFlag, opt_coeffNum, rand_numCoeff); + int ref_scanPos = ref(scanTbl, ref_src + i, ref_coeffSign, ref_coeffFlag, ref_coeffNum, rand_numCoeff, scanTblCG4x4, trSize); + int opt_scanPos = (int)checked(opt, scanTbl, ref_src + i, opt_coeffSign, opt_coeffFlag, opt_coeffNum, rand_numCoeff, scanTblCG4x4, trSize); if (ref_scanPos != opt_scanPos) return false; @@ -1209,6 +1282,56 @@ rand_numCoeff -= ref_coeffNum[j]; } + if (rand_numCoeff != 0) + return false; + + reportfail(); + } + + return true; +} + +bool PixelHarness::check_findPosFirstLast(findPosFirstLast_t ref, findPosFirstLast_t opt) +{ + ALIGN_VAR_16(coeff_t, ref_src[32 * 32 + ITERS * 2]); + + for (int i = 0; i < 32 * 32; i++) + { + ref_src[i] = rand() & SHORT_MAX; + } + + // extra test area all of 0x1234 + for (int i = 0; i < ITERS * 2; i++) + { + ref_src[32 * 32 + i] = 0x1234; + } + + for (int i = 0; i < ITERS; i++) + { + int rand_scan_type = rand() % NUM_SCAN_TYPE; + int rand_scan_size = (rand() % NUM_SCAN_SIZE) + 2; + coeff_t *rand_src = ref_src + i; + + const uint16_t* const scanTbl = g_scan4x4[rand_scan_type]; + + int j; + for (j = 0; j < SCAN_SET_SIZE; j++) + { + const uint32_t idxY = j / MLS_CG_SIZE; + const uint32_t idxX = j % MLS_CG_SIZE; + if (rand_src[idxY * rand_scan_size + idxX]) break; + } + + // fill one coeff when all coeff group are zero + if (j >= SCAN_SET_SIZE) + rand_src[0] = 0x0BAD; + + uint32_t ref_scanPos = ref(rand_src, (1 << rand_scan_size), scanTbl); + uint32_t opt_scanPos = (int)checked(opt, rand_src, (1 << rand_scan_size), scanTbl); + + if (ref_scanPos != opt_scanPos) + return false; + reportfail(); } @@ -1414,6 +1537,14 @@ return false; } } + if (opt.chroma[i].cu[part].sa8d) + { + if (!check_pixelcmp(ref.chroma[i].cu[part].sa8d, opt.chroma[i].cu[part].sa8d)) + { + printf("chroma_sa8d[%s][%s] failed\n", x265_source_csp_names[i], chromaPartStr[i][part]); + return false; + } + } } } @@ -1603,7 +1734,7 @@ if (opt.scale1D_128to64) { - if (!check_scale_pp(ref.scale1D_128to64, opt.scale1D_128to64)) + if (!check_scale1D_pp(ref.scale1D_128to64, opt.scale1D_128to64)) { printf("scale1D_128to64 failed!\n"); return false; @@ -1612,7 +1743,7 @@ if (opt.scale2D_64to32) { - if (!check_scale_pp(ref.scale2D_64to32, opt.scale2D_64to32)) + if (!check_scale2D_pp(ref.scale2D_64to32, opt.scale2D_64to32)) { printf("scale2D_64to32 failed!\n"); return false; @@ -1664,20 +1795,41 @@ } } - if (opt.saoCuOrgE2) + if (opt.saoCuOrgE1_2Rows) + { + if (!check_saoCuOrgE1_t(ref.saoCuOrgE1_2Rows, opt.saoCuOrgE1_2Rows)) + { + printf("SAO_EO_1_2Rows failed\n"); + return false; + } + } + + if (opt.saoCuOrgE2[0] || opt.saoCuOrgE2[1]) + { + saoCuOrgE2_t ref1[] = { ref.saoCuOrgE2[0], ref.saoCuOrgE2[1] }; + saoCuOrgE2_t opt1[] = { opt.saoCuOrgE2[0], opt.saoCuOrgE2[1] }; + + if (!check_saoCuOrgE2_t(ref1, opt1)) + { + printf("SAO_EO_2[0] && SAO_EO_2[1] failed\n"); + return false; + } + } + + if (opt.saoCuOrgE3[0]) { - if (!check_saoCuOrgE2_t(ref.saoCuOrgE2, opt.saoCuOrgE2)) + if (!check_saoCuOrgE3_t(ref.saoCuOrgE3[0], opt.saoCuOrgE3[0])) { - printf("SAO_EO_2 failed\n"); + printf("SAO_EO_3[0] failed\n"); return false; } } - if (opt.saoCuOrgE3) + if (opt.saoCuOrgE3[1]) { - if (!check_saoCuOrgE3_t(ref.saoCuOrgE3, opt.saoCuOrgE3)) + if (!check_saoCuOrgE3_32_t(ref.saoCuOrgE3[1], opt.saoCuOrgE3[1])) { - printf("SAO_EO_3 failed\n"); + printf("SAO_EO_3[1] failed\n"); return false; } } @@ -1718,11 +1870,20 @@ } } - if (opt.findPosLast) + if (opt.scanPosLast) { - if (!check_findPosLast(ref.findPosLast, opt.findPosLast)) + if (!check_scanPosLast(ref.scanPosLast, opt.scanPosLast)) { - printf("findPosLast failed!\n"); + printf("scanPosLast failed!\n"); + return false; + } + } + + if (opt.findPosFirstLast) + { + if (!check_findPosFirstLast(ref.findPosFirstLast, opt.findPosFirstLast)) + { + printf("findPosFirstLast failed!\n"); return false; } } @@ -1863,6 +2024,11 @@ HEADER("[%s] add_ps[%s]", x265_source_csp_names[i], chromaPartStr[i][part]); REPORT_SPEEDUP(opt.chroma[i].cu[part].add_ps, ref.chroma[i].cu[part].add_ps, pbuf1, FENC_STRIDE, pbuf2, sbuf1, STRIDE, STRIDE); } + if (opt.chroma[i].cu[part].sa8d) + { + HEADER("[%s] sa8d[%s]", x265_source_csp_names[i], chromaPartStr[i][part]); + REPORT_SPEEDUP(opt.chroma[i].cu[part].sa8d, ref.chroma[i].cu[part].sa8d, pbuf1, STRIDE, pbuf2, STRIDE); + } } } @@ -2003,7 +2169,7 @@ if (opt.scale1D_128to64) { HEADER0("scale1D_128to64"); - REPORT_SPEEDUP(opt.scale1D_128to64, ref.scale1D_128to64, pbuf2, pbuf1, 64); + REPORT_SPEEDUP(opt.scale1D_128to64, ref.scale1D_128to64, pbuf2, pbuf1); } if (opt.scale2D_64to32) @@ -2033,7 +2199,7 @@ if (opt.saoCuOrgE0) { HEADER0("SAO_EO_0"); - REPORT_SPEEDUP(opt.saoCuOrgE0, ref.saoCuOrgE0, pbuf1, psbuf1, 64, 1); + REPORT_SPEEDUP(opt.saoCuOrgE0, ref.saoCuOrgE0, pbuf1, psbuf1, 64, psbuf2, 64); } if (opt.saoCuOrgE1) @@ -2042,16 +2208,34 @@ REPORT_SPEEDUP(opt.saoCuOrgE1, ref.saoCuOrgE1, pbuf1, psbuf2, psbuf1, 64, 64); } - if (opt.saoCuOrgE2) + if (opt.saoCuOrgE1_2Rows) { - HEADER0("SAO_EO_2"); - REPORT_SPEEDUP(opt.saoCuOrgE2, ref.saoCuOrgE2, pbuf1, psbuf1, psbuf2, psbuf3, 64, 64); + HEADER0("SAO_EO_1_2Rows"); + REPORT_SPEEDUP(opt.saoCuOrgE1_2Rows, ref.saoCuOrgE1_2Rows, pbuf1, psbuf2, psbuf1, 64, 64); } - if (opt.saoCuOrgE3) + if (opt.saoCuOrgE2[0]) { - HEADER0("SAO_EO_3"); - REPORT_SPEEDUP(opt.saoCuOrgE3, ref.saoCuOrgE3, pbuf1, psbuf2, psbuf1, 64, 0, 64); + HEADER0("SAO_EO_2[0]"); + REPORT_SPEEDUP(opt.saoCuOrgE2[0], ref.saoCuOrgE2[0], pbuf1, psbuf1, psbuf2, psbuf3, 16, 64); + } + + if (opt.saoCuOrgE2[1]) + { + HEADER0("SAO_EO_2[1]"); + REPORT_SPEEDUP(opt.saoCuOrgE2[1], ref.saoCuOrgE2[1], pbuf1, psbuf1, psbuf2, psbuf3, 64, 64); + } + + if (opt.saoCuOrgE3[0]) + { + HEADER0("SAO_EO_3[0]"); + REPORT_SPEEDUP(opt.saoCuOrgE3[0], ref.saoCuOrgE3[0], pbuf1, psbuf2, psbuf1, 64, 0, 16); + } + + if (opt.saoCuOrgE3[1]) + { + HEADER0("SAO_EO_3[1]"); + REPORT_SPEEDUP(opt.saoCuOrgE3[1], ref.saoCuOrgE3[1], pbuf1, psbuf2, psbuf1, 64, 0, 64); } if (opt.saoCuOrgB0) @@ -2078,12 +2262,25 @@ REPORT_SPEEDUP(opt.propagateCost, ref.propagateCost, ibuf1, ushort_test_buff[0], int_test_buff[0], ushort_test_buff[0], int_test_buff[0], double_test_buff[0], 80); } - if (opt.findPosLast) + if (opt.scanPosLast) { - HEADER0("findPosLast"); + HEADER0("scanPosLast"); coeff_t coefBuf[32 * 32]; memset(coefBuf, 0, sizeof(coefBuf)); memset(coefBuf + 32 * 31, 1, 32 * sizeof(coeff_t)); - REPORT_SPEEDUP(opt.findPosLast, ref.findPosLast, g_scanOrder[SCAN_DIAG][NUM_SCAN_SIZE - 1], coefBuf, (uint16_t*)sbuf1, (uint16_t*)sbuf2, (uint8_t*)psbuf1, 32); + REPORT_SPEEDUP(opt.scanPosLast, ref.scanPosLast, g_scanOrder[SCAN_DIAG][NUM_SCAN_SIZE - 1], coefBuf, (uint16_t*)sbuf1, (uint16_t*)sbuf2, (uint8_t*)psbuf1, 32, g_scan4x4[SCAN_DIAG], 32); + } + + if (opt.findPosFirstLast) + { + HEADER0("findPosFirstLast"); + coeff_t coefBuf[32 * MLS_CG_SIZE]; + memset(coefBuf, 0, sizeof(coefBuf)); + // every CG can't be all zeros! + coefBuf[3 + 0 * 32] = 0x0BAD; + coefBuf[3 + 1 * 32] = 0x0BAD; + coefBuf[3 + 2 * 32] = 0x0BAD; + coefBuf[3 + 3 * 32] = 0x0BAD; + REPORT_SPEEDUP(opt.findPosFirstLast, ref.findPosFirstLast, coefBuf, 32, g_scan4x4[SCAN_DIAG]); } }
View file
x265_1.6.tar.gz/source/test/pixelharness.h -> x265_1.7.tar.gz/source/test/pixelharness.h
Changed
@@ -76,7 +76,8 @@ bool check_pixelavg_pp(pixelavg_pp_t ref, pixelavg_pp_t opt); bool check_pixel_sub_ps(pixel_sub_ps_t ref, pixel_sub_ps_t opt); bool check_pixel_add_ps(pixel_add_ps_t ref, pixel_add_ps_t opt); - bool check_scale_pp(scale_t ref, scale_t opt); + bool check_scale1D_pp(scale1D_t ref, scale1D_t opt); + bool check_scale2D_pp(scale2D_t ref, scale2D_t opt); bool check_ssd_s(pixel_ssd_s_t ref, pixel_ssd_s_t opt); bool check_blockfill_s(blockfill_s_t ref, blockfill_s_t opt); bool check_calresidual(calcresidual_t ref, calcresidual_t opt); @@ -95,8 +96,9 @@ bool check_addAvg(addAvg_t, addAvg_t); bool check_saoCuOrgE0_t(saoCuOrgE0_t ref, saoCuOrgE0_t opt); bool check_saoCuOrgE1_t(saoCuOrgE1_t ref, saoCuOrgE1_t opt); - bool check_saoCuOrgE2_t(saoCuOrgE2_t ref, saoCuOrgE2_t opt); + bool check_saoCuOrgE2_t(saoCuOrgE2_t ref[], saoCuOrgE2_t opt[]); bool check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt); + bool check_saoCuOrgE3_32_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt); bool check_saoCuOrgB0_t(saoCuOrgB0_t ref, saoCuOrgB0_t opt); bool check_planecopy_sp(planecopy_sp_t ref, planecopy_sp_t opt); bool check_planecopy_cp(planecopy_cp_t ref, planecopy_cp_t opt); @@ -104,7 +106,8 @@ bool check_psyCost_pp(pixelcmp_t ref, pixelcmp_t opt); bool check_psyCost_ss(pixelcmp_ss_t ref, pixelcmp_ss_t opt); bool check_calSign(sign_t ref, sign_t opt); - bool check_findPosLast(findPosLast_t ref, findPosLast_t opt); + bool check_scanPosLast(scanPosLast_t ref, scanPosLast_t opt); + bool check_findPosFirstLast(findPosFirstLast_t ref, findPosFirstLast_t opt); public:
View file
x265_1.6.tar.gz/source/test/rate-control-tests.txt -> x265_1.7.tar.gz/source/test/rate-control-tests.txt
Changed
@@ -1,34 +1,36 @@ -# List of command lines to be run by rate control regression tests, see https://bitbucket.org/sborho/test-harness - -# This test is listed first since it currently reproduces bugs -big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --pass 1 -F4,--preset medium --bitrate 1000 --pass 2 -F4 - -# VBV tests, non-deterministic so testing for correctness and bitrate -# fluctuations - up to 1% bitrate fluctuation is allowed between runs -RaceHorses_416x240_30_10bit.yuv,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 -RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --vbv-bufsize 600 --vbv-maxrate 600 -RaceHorses_416x240_30_10bit.yuv,--preset veryslow --bitrate 1100 --vbv-bufsize 1100 --vbv-maxrate 1200 -112_1920x1080_25.yuv,--preset medium --bitrate 1000 --vbv-maxrate 1500 --vbv-bufsize 1500 --aud -112_1920x1080_25.yuv,--preset medium --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd -112_1920x1080_25.yuv,--preset medium --bitrate 4000 --vbv-maxrate 12000 --vbv-bufsize 12000 --repeat-headers -112_1920x1080_25.yuv,--preset superfast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 1500 --hrd --strict-cbr -112_1920x1080_25.yuv,--preset superfast --bitrate 30000 --vbv-maxrate 30000 --vbv-bufsize 30000 --repeat-headers -112_1920x1080_25.yuv,--preset superfast --bitrate 4000 --vbv-maxrate 6000 --vbv-bufsize 6000 --aud -112_1920x1080_25.yuv,--preset veryslow --bitrate 1000 --vbv-maxrate 3000 --vbv-bufsize 3000 --repeat-headers -big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --vbv-bufsize 3000 --vbv-maxrate 3000 --repeat-headers -big_buck_bunny_360p24.y4m,--preset medium --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd -big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud -big_buck_bunny_360p24.y4m,--preset medium --crf 1 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd -big_buck_bunny_360p24.y4m,--preset superfast --bitrate 1000 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud --strict-cbr -big_buck_bunny_360p24.y4m,--preset superfast --bitrate 3000 --vbv-bufsize 9000 --vbv-maxrate 9000 --repeat-headers -big_buck_bunny_360p24.y4m,--preset superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd -big_buck_bunny_360p24.y4m,--preset superfast --crf 6 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud - -# multi-pass rate control tests -big_buck_bunny_360p24.y4m,--preset slow --crf 40 --pass 1,--preset slow --bitrate 200 --pass 2 -big_buck_bunny_360p24.y4m,--preset medium --bitrate 700 --pass 1 -F4 --slow-firstpass,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 --pass 2 -F4 -112_1920x1080_25.yuv,--preset slow --bitrate 1000 --pass 1 -F4,--preset slow --bitrate 1000 --pass 2 -F4 -112_1920x1080_25.yuv,--preset superfast --crf 12 --pass 1,--preset superfast --bitrate 4000 --pass 2 -F4 -RaceHorses_416x240_30_10bit.yuv,--preset veryslow --crf 40 --pass 1, --preset veryslow --bitrate 200 --pass 2 -F4 -RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 600 --pass 2 -F4 -RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --pass 1,--preset medium --bitrate 500 --pass 3 -F4,--preset medium --bitrate 500 --pass 2 -F4 +# List of command lines to be run by rate control regression tests, see https://bitbucket.org/sborho/test-harness + +#These tests should yeild deterministic results +# This test is listed first since it currently reproduces bugs +big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --pass 1 -F4,--preset medium --bitrate 1000 --pass 2 -F4 +fire_1920x1080_30.yuv, --preset slow --bitrate 2000 --tune zero-latency + + +# VBV tests, non-deterministic so testing for correctness and bitrate +# fluctuations - up to 1% bitrate fluctuation is allowed between runs +night_cars_1920x1080_30.yuv,--preset medium --crf 25 --vbv-bufsize 5000 --vbv-maxrate 5000 -F6 --crf-max 34 --crf-min 22 +ducks_take_off_420_720p50.y4m,--preset slow --bitrate 1600 --vbv-bufsize 1600 --vbv-maxrate 1600 --strict-cbr --aq-mode 2 --aq-strength 0.5 +CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryslow --bitrate 4000 --vbv-bufsize 3000 --vbv-maxrate 4000 --tune grain +fire_1920x1080_30.yuv,--preset medium --bitrate 1000 --vbv-maxrate 1500 --vbv-bufsize 1500 --aud --pmode --tune ssim +112_1920x1080_25.yuv,--preset ultrafast --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd --strict-cbr +Traffic_4096x2048_30.yuv,--preset superfast --bitrate 20000 --vbv-maxrate 20000 --vbv-bufsize 20000 --repeat-headers --strict-cbr +Traffic_4096x2048_30.yuv,--preset faster --bitrate 8000 --vbv-maxrate 8000 --vbv-bufsize 6000 --aud --repeat-headers --no-open-gop --hrd --pmode --pme +News-4k.y4m,--preset veryfast --bitrate 3000 --vbv-maxrate 5000 --vbv-bufsize 5000 --repeat-headers --temporal-layers +NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 18000 --vbv-bufsize 20000 --vbv-maxrate 18000 --strict-cbr +NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 8000 --vbv-bufsize 12000 --vbv-maxrate 10000 --tune grain +big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud --hrd --tune fast-decode +sita_1920x1080_30.yuv,--preset superfast --crf 25 --vbv-bufsize 3000 --vbv-maxrate 4000 --vbv-bufsize 5000 --hrd --crf-max 30 +sita_1920x1080_30.yuv,--preset superfast --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --aud --strict-cbr + + + +# multi-pass rate control tests +big_buck_bunny_360p24.y4m,--preset slow --crf 40 --pass 1 -f 5000,--preset slow --bitrate 200 --pass 2 -f 5000 +big_buck_bunny_360p24.y4m,--preset medium --bitrate 700 --pass 1 -F4 --slow-firstpass -f 5000 ,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 --pass 2 -F4 -f 5000 +112_1920x1080_25.yuv,--preset fast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 1000 --strict-cbr --pass 1 -F4,--preset fast --bitrate 1000 --vbv-maxrate 3000 --vbv-bufsize 3000 --pass 2 -F4 +pine_tree_1920x1080_30.yuv,--preset veryfast --crf 12 --pass 1 -F4,--preset faster --bitrate 4000 --pass 2 -F4 +SteamLocomotiveTrain_2560x1600_60_10bit_crop.yuv, --tune grain --preset ultrafast --bitrate 5000 --vbv-maxrate 5000 --vbv-bufsize 8000 --strict-cbr -F4 --pass 1, --tune grain --preset ultrafast --bitrate 8000 --vbv-maxrate 8000 --vbv-bufsize 8000 -F4 --pass 2 +RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 40 --pass 1, --preset faster --bitrate 200 --pass 2 -F4 +CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --bitrate 2500 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 2500 --pass 2 -F4 +RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --vbv-maxrate 1000 --vbv-bufsize 1000 --pass 1,--preset fast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 700 --pass 3 -F4,--preset slow --bitrate 500 --vbv-maxrate 500 --vbv-bufsize 700 --pass 2 -F4 +
View file
x265_1.6.tar.gz/source/test/regression-tests.txt -> x265_1.7.tar.gz/source/test/regression-tests.txt
Changed
@@ -12,9 +12,9 @@ # not auto-detected. BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190 -BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 +BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 32 BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao -BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 +BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 --qg-size 16 BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0 BasketballDrive_1920x1080_50.y4m,--preset superfast --psy-rd 1 --ctu 16 --no-wpp BasketballDrive_1920x1080_50.y4m,--preset ultrafast --signhide --colormatrix bt709 @@ -29,7 +29,7 @@ CrowdRun_1920x1080_50_10bit_422.yuv,--preset slow --no-wpp --tune ssim --transfer smpte240m CrowdRun_1920x1080_50_10bit_422.yuv,--preset slower --tune ssim --tune fastdecode CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao -CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency +CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16 CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers --tune grain CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1 CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd @@ -37,8 +37,8 @@ CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16 -DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd -DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd --qg-size 32 +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp --qg-size 16 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4 @@ -51,11 +51,11 @@ Kimono1_1920x1080_24_10bit_444.yuv,--preset superfast --weightb KristenAndSara_1280x720_60.y4m,--preset medium --no-cutree --max-tu-size 16 KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8 -KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16 +KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16 --qg-size 16 KristenAndSara_1280x720_60.y4m,--preset ultrafast --strong-intra-smoothing NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr -News-4k.y4m,--preset medium --tune ssim --no-sao +News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 32 News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0 OldTownCross_1920x1080_50_10bit_422.yuv,--preset medium --no-weightp OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune fastdecode @@ -108,13 +108,13 @@ parkrun_ter_720p50.y4m,--preset slower --fast-intra --no-rect --tune grain silent_cif_420.y4m,--preset medium --me full --rect --amp silent_cif_420.y4m,--preset superfast --weightp --rect -silent_cif_420.y4m,--preset placebo --ctu 32 --no-sao +silent_cif_420.y4m,--preset placebo --ctu 32 --no-sao --qg-size 16 vtc1nw_422_ntsc.y4m,--preset medium --scaling-list default --ctu 16 --ref 5 -vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode +vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode --qg-size 16 vtc1nw_422_ntsc.y4m,--preset superfast --weightp --nr-intra 100 -F4 washdc_422_ntsc.y4m,--preset faster --rdoq-level 1 --max-merge 5 washdc_422_ntsc.y4m,--preset medium --no-weightp --max-tu-size 4 -washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2 +washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2 --qg-size 32 washdc_422_ntsc.y4m,--preset superfast --psy-rd 1 --tune zerolatency washdc_422_ntsc.y4m,--preset ultrafast --weightp --tu-intra-depth 4 washdc_422_ntsc.y4m,--preset veryfast --tu-inter-depth 4
View file
x265_1.6.tar.gz/source/test/smoke-tests.txt -> x265_1.7.tar.gz/source/test/smoke-tests.txt
Changed
@@ -1,14 +1,18 @@ # List of command lines to be run by smoke tests, see https://bitbucket.org/sborho/test-harness +# consider VBV tests a failure if new bitrate is more than 5% different +# from the old bitrate +# vbv-tolerance = 0.05 + big_buck_bunny_360p24.y4m,--preset=superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd --aud --repeat-headers big_buck_bunny_360p24.y4m,--preset=medium --bitrate 1000 -F4 --cu-lossless --scaling-list default -big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --cu-stats --pme -washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1 +big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --cu-stats --pme --qg-size 16 +washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1 --qg-size 16 washdc_422_ntsc.y4m,--preset=medium --qp 40 --nr-inter 400 -F4 washdc_422_ntsc.y4m,--preset=veryslow --pmode --tskip --rdoq-level 0 old_town_cross_444_720p50.y4m,--preset=ultrafast --weightp --keyint -1 old_town_cross_444_720p50.y4m,--preset=fast --keyint 20 --min-cu-size 16 -old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode +old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode --qg-size 32 RaceHorses_416x240_30_10bit.yuv,--preset=veryfast --cu-stats --max-tu-size 8 RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1 CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --constrained-intra --min-keyint 5 --keyint 10
View file
x265_1.6.tar.gz/source/test/testbench.cpp -> x265_1.7.tar.gz/source/test/testbench.cpp
Changed
@@ -168,6 +168,7 @@ { "AVX", X265_CPU_AVX }, { "XOP", X265_CPU_XOP }, { "AVX2", X265_CPU_AVX2 }, + { "BMI2", X265_CPU_AVX2 | X265_CPU_BMI1 | X265_CPU_BMI2 }, { "", 0 }, };
View file
x265_1.6.tar.gz/source/x265.cpp -> x265_1.7.tar.gz/source/x265.cpp
Changed
@@ -27,6 +27,7 @@ #include "input/input.h" #include "output/output.h" +#include "output/reconplay.h" #include "filters/filters.h" #include "common.h" #include "param.h" @@ -46,12 +47,16 @@ #include <string> #include <ostream> #include <fstream> +#include <queue> +#define CONSOLE_TITLE_SIZE 200 #ifdef _WIN32 #include <windows.h> +static char orgConsoleTitle[CONSOLE_TITLE_SIZE] = ""; #else #define GetConsoleTitle(t, n) #define SetConsoleTitle(t) +#define SetThreadExecutionState(es) #endif using namespace x265; @@ -65,33 +70,34 @@ struct CLIOptions { - Input* input; - Output* recon; - std::fstream bitstreamFile; + InputFile* input; + ReconFile* recon; + OutputFile* output; + FILE* qpfile; + const char* reconPlayCmd; + const x265_api* api; + x265_param* param; bool bProgress; bool bForceY4m; bool bDither; - uint32_t seek; // number of frames to skip from the beginning uint32_t framesToBeEncoded; // number of frames to encode uint64_t totalbytes; - size_t analysisRecordSize; // number of bytes read from or dumped into file - int analysisHeaderSize; - int64_t startTime; int64_t prevUpdateTime; - float frameRate; - FILE* qpfile; - FILE* analysisFile; /* in microseconds */ static const int UPDATE_INTERVAL = 250000; CLIOptions() { - frameRate = 0.f; input = NULL; recon = NULL; + output = NULL; + qpfile = NULL; + reconPlayCmd = NULL; + api = NULL; + param = NULL; framesToBeEncoded = seek = 0; totalbytes = 0; bProgress = true; @@ -99,18 +105,12 @@ startTime = x265_mdate(); prevUpdateTime = 0; bDither = false; - qpfile = NULL; - analysisFile = NULL; - analysisRecordSize = 0; - analysisHeaderSize = 0; } void destroy(); - void writeNALs(const x265_nal* nal, uint32_t nalcount); - void printStatus(uint32_t frameNum, x265_param *param); - bool parse(int argc, char **argv, x265_param* param); + void printStatus(uint32_t frameNum); + bool parse(int argc, char **argv); bool parseQPFile(x265_picture &pic_org); - bool validateFanout(x265_param*); }; void CLIOptions::destroy() @@ -124,23 +124,12 @@ if (qpfile) fclose(qpfile); qpfile = NULL; - if (analysisFile) - fclose(analysisFile); - analysisFile = NULL; + if (output) + output->release(); + output = NULL; } -void CLIOptions::writeNALs(const x265_nal* nal, uint32_t nalcount) -{ - ProfileScopeEvent(bitstreamWrite); - for (uint32_t i = 0; i < nalcount; i++) - { - bitstreamFile.write((const char*)nal->payload, nal->sizeBytes); - totalbytes += nal->sizeBytes; - nal++; - } -} - -void CLIOptions::printStatus(uint32_t frameNum, x265_param *param) +void CLIOptions::printStatus(uint32_t frameNum) { char buf[200]; int64_t time = x265_mdate(); @@ -167,15 +156,16 @@ prevUpdateTime = time; } -bool CLIOptions::parse(int argc, char **argv, x265_param* param) +bool CLIOptions::parse(int argc, char **argv) { bool bError = 0; int help = 0; int inputBitDepth = 8; + int outputBitDepth = 0; int reconFileBitDepth = 0; const char *inputfn = NULL; const char *reconfn = NULL; - const char *bitstreamfn = NULL; + const char *outputfn = NULL; const char *preset = NULL; const char *tune = NULL; const char *profile = NULL; @@ -192,15 +182,31 @@ int c = getopt_long(argc, argv, short_options, long_options, NULL); if (c == -1) break; - if (c == 'p') + else if (c == 'p') preset = optarg; - if (c == 't') + else if (c == 't') tune = optarg; + else if (c == 'D') + outputBitDepth = atoi(optarg); else if (c == '?') showHelp(param); } - if (x265_param_default_preset(param, preset, tune) < 0) + api = x265_api_get(outputBitDepth); + if (!api) + { + x265_log(NULL, X265_LOG_WARNING, "falling back to default bit-depth\n"); + api = x265_api_get(0); + } + + param = api->param_alloc(); + if (!param) + { + x265_log(NULL, X265_LOG_ERROR, "param alloc failed\n"); + return true; + } + + if (api->param_default_preset(param, preset, tune) < 0) { x265_log(NULL, X265_LOG_ERROR, "preset or tune unrecognized\n"); return true; @@ -211,9 +217,7 @@ int long_options_index = -1; int c = getopt_long(argc, argv, short_options, long_options, &long_options_index); if (c == -1) - { break; - } switch (c) { @@ -261,7 +265,7 @@ OPT2("frame-skip", "seek") this->seek = (uint32_t)x265_atoi(optarg, bError); OPT("frames") this->framesToBeEncoded = (uint32_t)x265_atoi(optarg, bError); OPT("no-progress") this->bProgress = false; - OPT("output") bitstreamfn = optarg; + OPT("output") outputfn = optarg; OPT("input") inputfn = optarg; OPT("recon") reconfn = optarg; OPT("input-depth") inputBitDepth = (uint32_t)x265_atoi(optarg, bError); @@ -271,17 +275,19 @@ OPT("profile") profile = optarg; /* handled last */ OPT("preset") /* handled above */; OPT("tune") /* handled above */; + OPT("output-depth") /* handled above */; + OPT("recon-y4m-exec") reconPlayCmd = optarg; OPT("qpfile") { this->qpfile = fopen(optarg, "rb"); if (!this->qpfile) { - x265_log(param, X265_LOG_ERROR, "%s qpfile not found or error in opening qp file \n", optarg); + x265_log(param, X265_LOG_ERROR, "%s qpfile not found or error in opening qp file\n", optarg); return false; } } else - bError |= !!x265_param_parse(param, long_options[long_options_index].name, optarg); + bError |= !!api->param_parse(param, long_options[long_options_index].name, optarg); if (bError) { @@ -295,8 +301,8 @@ if (optind < argc && !inputfn) inputfn = argv[optind++]; - if (optind < argc && !bitstreamfn) - bitstreamfn = argv[optind++]; + if (optind < argc && !outputfn) + outputfn = argv[optind++]; if (optind < argc) { x265_log(param, X265_LOG_WARNING, "extra unused command arguments given <%s>\n", argv[optind]); @@ -306,15 +312,15 @@ if (argc <= 1 || help) showHelp(param); - if (inputfn == NULL || bitstreamfn == NULL) + if (inputfn == NULL || outputfn == NULL) { x265_log(param, X265_LOG_ERROR, "input or output file not specified, try -V for help\n"); return true; } - if (param->internalBitDepth != x265_max_bit_depth) + if (param->internalBitDepth != api->max_bit_depth) { - x265_log(param, X265_LOG_ERROR, "Only bit depths of %d are supported in this build\n", x265_max_bit_depth); + x265_log(param, X265_LOG_ERROR, "Only bit depths of %d are supported in this build\n", api->max_bit_depth); return true; } @@ -332,7 +338,7 @@ info.frameCount = 0; getParamAspectRatio(param, info.sarWidth, info.sarHeight); - this->input = Input::open(info, this->bForceY4m); + this->input = InputFile::open(info, this->bForceY4m); if (!this->input || this->input->isFail()) { x265_log(param, X265_LOG_ERROR, "unable to open input file <%s>\n", inputfn); @@ -362,7 +368,11 @@ this->framesToBeEncoded = info.frameCount - seek; param->totalFrames = this->framesToBeEncoded; - if (x265_param_apply_profile(param, profile)) + /* Force CFR until we have support for VFR */ + info.timebaseNum = param->fpsDenom; + info.timebaseDenom = param->fpsNum; + + if (api->param_apply_profile(param, profile)) return true; if (param->logLevel >= X265_LOG_INFO) @@ -381,7 +391,7 @@ else sprintf(buf + p, " frames %u - %d of %d", this->seek, this->seek + this->framesToBeEncoded - 1, info.frameCount); - fprintf(stderr, "%s [info]: %s\n", input->getName(), buf); + general_log(param, input->getName(), X265_LOG_INFO, "%s\n", buf); } this->input->startReader(); @@ -390,26 +400,28 @@ { if (reconFileBitDepth == 0) reconFileBitDepth = param->internalBitDepth; - this->recon = Output::open(reconfn, param->sourceWidth, param->sourceHeight, reconFileBitDepth, - param->fpsNum, param->fpsDenom, param->internalCsp); + this->recon = ReconFile::open(reconfn, param->sourceWidth, param->sourceHeight, reconFileBitDepth, + param->fpsNum, param->fpsDenom, param->internalCsp); if (this->recon->isFail()) { - x265_log(param, X265_LOG_WARNING, "unable to write reconstruction file\n"); + x265_log(param, X265_LOG_WARNING, "unable to write reconstructed outputs file\n"); this->recon->release(); this->recon = 0; } else - fprintf(stderr, "%s [info]: reconstructed images %dx%d fps %d/%d %s\n", this->recon->getName(), + general_log(param, this->recon->getName(), X265_LOG_INFO, + "reconstructed images %dx%d fps %d/%d %s\n", param->sourceWidth, param->sourceHeight, param->fpsNum, param->fpsDenom, x265_source_csp_names[param->internalCsp]); } - this->bitstreamFile.open(bitstreamfn, std::fstream::binary | std::fstream::out); - if (!this->bitstreamFile) + this->output = OutputFile::open(outputfn, info); + if (this->output->isFail()) { - x265_log(NULL, X265_LOG_ERROR, "failed to open bitstream file <%s> for writing\n", bitstreamfn); + x265_log(param, X265_LOG_ERROR, "failed to open output file <%s> for writing\n", outputfn); return true; } + general_log(param, this->output->getName(), X265_LOG_INFO, "output file: %s\n", outputfn); return false; } @@ -464,28 +476,45 @@ PROFILE_INIT(); THREAD_NAME("API", 0); - x265_param *param = x265_param_alloc(); + GetConsoleTitle(orgConsoleTitle, CONSOLE_TITLE_SIZE); + SetThreadExecutionState(ES_CONTINUOUS | ES_SYSTEM_REQUIRED | ES_AWAYMODE_REQUIRED); + + ReconPlay* reconPlay = NULL; CLIOptions cliopt; - if (cliopt.parse(argc, argv, param)) + if (cliopt.parse(argc, argv)) { cliopt.destroy(); - x265_param_free(param); + if (cliopt.api) + cliopt.api->param_free(cliopt.param); exit(1); } - x265_encoder *encoder = x265_encoder_open(param); + x265_param* param = cliopt.param; + const x265_api* api = cliopt.api; + + /* This allows muxers to modify bitstream format */ + cliopt.output->setParam(param); + + if (cliopt.reconPlayCmd) + reconPlay = new ReconPlay(cliopt.reconPlayCmd, *param); + + /* note: we could try to acquire a different libx265 API here based on + * the profile found during option parsing, but it must be done before + * opening an encoder */ + + x265_encoder *encoder = api->encoder_open(param); if (!encoder) { x265_log(param, X265_LOG_ERROR, "failed to open encoder\n"); cliopt.destroy(); - x265_param_free(param); - x265_cleanup(); + api->param_free(param); + api->cleanup(); exit(2); } /* get the encoder parameters post-initialization */ - x265_encoder_parameters(encoder, param); + api->encoder_parameters(encoder, param); /* Control-C handler */ if (signal(SIGINT, sigint_handler) == SIG_ERR) @@ -494,7 +523,8 @@ x265_picture pic_orig, pic_out; x265_picture *pic_in = &pic_orig; /* Allocate recon picture if analysisMode is enabled */ - x265_picture *pic_recon = (cliopt.recon || !!param->analysisMode) ? &pic_out : NULL; + std::priority_queue<int64_t>* pts_queue = cliopt.output->needPTS() ? new std::priority_queue<int64_t>() : NULL; + x265_picture *pic_recon = (cliopt.recon || !!param->analysisMode || pts_queue || reconPlay) ? &pic_out : NULL; uint32_t inFrameCount = 0; uint32_t outFrameCount = 0; x265_nal *p_nal; @@ -505,17 +535,17 @@ if (!param->bRepeatHeaders) { - if (x265_encoder_headers(encoder, &p_nal, &nal) < 0) + if (api->encoder_headers(encoder, &p_nal, &nal) < 0) { x265_log(param, X265_LOG_ERROR, "Failure generating stream headers\n"); ret = 3; goto fail; } else - cliopt.writeNALs(p_nal, nal); + cliopt.totalbytes += cliopt.output->writeHeaders(p_nal, nal); } - x265_picture_init(param, pic_in); + api->picture_init(param, pic_in); if (cliopt.bDither) { @@ -549,46 +579,72 @@ if (pic_in) { - if (pic_in->bitDepth > X265_DEPTH && cliopt.bDither) + if (pic_in->bitDepth > param->internalBitDepth && cliopt.bDither) { - ditherImage(*pic_in, param->sourceWidth, param->sourceHeight, errorBuf, X265_DEPTH); - pic_in->bitDepth = X265_DEPTH; + ditherImage(*pic_in, param->sourceWidth, param->sourceHeight, errorBuf, param->internalBitDepth); + pic_in->bitDepth = param->internalBitDepth; } + /* Overwrite PTS */ + pic_in->pts = pic_in->poc; } - int numEncoded = x265_encoder_encode(encoder, &p_nal, &nal, pic_in, pic_recon); + int numEncoded = api->encoder_encode(encoder, &p_nal, &nal, pic_in, pic_recon); if (numEncoded < 0) { b_ctrl_c = 1; ret = 4; break; } + + if (reconPlay && numEncoded) + reconPlay->writePicture(*pic_recon); + outFrameCount += numEncoded; if (numEncoded && pic_recon && cliopt.recon) cliopt.recon->writePicture(pic_out); if (nal) - cliopt.writeNALs(p_nal, nal); + { + cliopt.totalbytes += cliopt.output->writeFrame(p_nal, nal, pic_out); + if (pts_queue) + { + pts_queue->push(-pic_out.pts); + if (pts_queue->size() > 2) + pts_queue->pop(); + } + } - cliopt.printStatus(outFrameCount, param); + cliopt.printStatus(outFrameCount); } /* Flush the encoder */ while (!b_ctrl_c) { - int numEncoded = x265_encoder_encode(encoder, &p_nal, &nal, NULL, pic_recon); + int numEncoded = api->encoder_encode(encoder, &p_nal, &nal, NULL, pic_recon); if (numEncoded < 0) { ret = 4; break; } + + if (reconPlay && numEncoded) + reconPlay->writePicture(*pic_recon); + outFrameCount += numEncoded; if (numEncoded && pic_recon && cliopt.recon) cliopt.recon->writePicture(pic_out); if (nal) - cliopt.writeNALs(p_nal, nal); + { + cliopt.totalbytes += cliopt.output->writeFrame(p_nal, nal, pic_out); + if (pts_queue) + { + pts_queue->push(-pic_out.pts); + if (pts_queue->size() > 2) + pts_queue->pop(); + } + } - cliopt.printStatus(outFrameCount, param); + cliopt.printStatus(outFrameCount); if (!numEncoded) break; @@ -599,42 +655,62 @@ fprintf(stderr, "%*s\r", 80, " "); fail: - x265_encoder_get_stats(encoder, &stats, sizeof(stats)); + + delete reconPlay; + + api->encoder_get_stats(encoder, &stats, sizeof(stats)); if (param->csvfn && !b_ctrl_c) - x265_encoder_log(encoder, argc, argv); - x265_encoder_close(encoder); - cliopt.bitstreamFile.close(); + api->encoder_log(encoder, argc, argv); + api->encoder_close(encoder); + + int64_t second_largest_pts = 0; + int64_t largest_pts = 0; + if (pts_queue && pts_queue->size() >= 2) + { + second_largest_pts = -pts_queue->top(); + pts_queue->pop(); + largest_pts = -pts_queue->top(); + pts_queue->pop(); + delete pts_queue; + pts_queue = NULL; + } + cliopt.output->closeFile(largest_pts, second_largest_pts); if (b_ctrl_c) - fprintf(stderr, "aborted at input frame %d, output frame %d\n", - cliopt.seek + inFrameCount, stats.encodedPictureCount); + general_log(param, NULL, X265_LOG_INFO, "aborted at input frame %d, output frame %d\n", + cliopt.seek + inFrameCount, stats.encodedPictureCount); if (stats.encodedPictureCount) { - printf("\nencoded %d frames in %.2fs (%.2f fps), %.2f kb/s", stats.encodedPictureCount, - stats.elapsedEncodeTime, stats.encodedPictureCount / stats.elapsedEncodeTime, stats.bitrate); + char buffer[4096]; + int p = sprintf(buffer, "\nencoded %d frames in %.2fs (%.2f fps), %.2f kb/s", stats.encodedPictureCount, + stats.elapsedEncodeTime, stats.encodedPictureCount / stats.elapsedEncodeTime, stats.bitrate); if (param->bEnablePsnr) - printf(", Global PSNR: %.3f", stats.globalPsnr); + p += sprintf(buffer + p, ", Global PSNR: %.3f", stats.globalPsnr); if (param->bEnableSsim) - printf(", SSIM Mean Y: %.7f (%6.3f dB)", stats.globalSsim, x265_ssim2dB(stats.globalSsim)); + p += sprintf(buffer + p, ", SSIM Mean Y: %.7f (%6.3f dB)", stats.globalSsim, x265_ssim2dB(stats.globalSsim)); - printf("\n"); + sprintf(buffer + p, "\n"); + general_log(param, NULL, X265_LOG_INFO, buffer); } else { - printf("\nencoded 0 frames\n"); + general_log(param, NULL, X265_LOG_INFO, "\nencoded 0 frames\n"); } - x265_cleanup(); /* Free library singletons */ + api->cleanup(); /* Free library singletons */ cliopt.destroy(); - x265_param_free(param); + api->param_free(param); X265_FREE(errorBuf); + SetConsoleTitle(orgConsoleTitle); + SetThreadExecutionState(ES_CONTINUOUS); + #if HAVE_VLD assert(VLDReportLeaks() == 0); #endif
View file
x265_1.6.tar.gz/source/x265.def.in -> x265_1.7.tar.gz/source/x265.def.in
Changed
@@ -14,6 +14,7 @@ x265_build_info_str x265_encoder_headers x265_encoder_parameters +x265_encoder_reconfig x265_encoder_encode x265_encoder_get_stats x265_encoder_log
View file
x265_1.6.tar.gz/source/x265.h -> x265_1.7.tar.gz/source/x265.h
Changed
@@ -416,7 +416,7 @@ * * Frame encoders are distributed between the available thread pools, and * the encoder will never generate more thread pools than frameNumThreads */ - char* numaPools; + const char* numaPools; /* Enable wavefront parallel processing, greatly increases parallelism for * less than 1% compression efficiency loss. Requires a thread pool, enabled @@ -458,7 +458,7 @@ * order. Otherwise the encoder will emit per-stream statistics into the log * file when x265_encoder_log is called (presumably at the end of the * encode) */ - char* csvfn; + const char* csvfn; /*== Internal Picture Specification ==*/ @@ -522,12 +522,21 @@ * performance. Value must be between 1 and 16, default is 3 */ int maxNumReferences; + /* Allow libx265 to emit HEVC bitstreams which do not meet strict level + * requirements. Defaults to false */ + int bAllowNonConformance; + /*== Bitstream Options ==*/ /* Flag indicating whether VPS, SPS and PPS headers should be output with * each keyframe. Default false */ int bRepeatHeaders; + /* Flag indicating whether the encoder should generate start codes (Annex B + * format) or length (file format) before NAL units. Default true, Annex B. + * Muxers should set this to the correct value */ + int bAnnexB; + /* Flag indicating whether the encoder should emit an Access Unit Delimiter * NAL at the start of every access unit. Default false */ int bEnableAccessUnitDelimiters; @@ -869,7 +878,7 @@ int analysisMode; /* Filename for analysisMode save/load. Default name is "x265_analysis.dat" */ - char* analysisFileName; + const char* analysisFileName; /*== Rate Control ==*/ @@ -962,7 +971,7 @@ /* Filename of the 2pass output/input stats file, if unspecified the * encoder will default to using x265_2pass.log */ - char* statFileName; + const char* statFileName; /* temporally blur quants */ double qblur; @@ -988,6 +997,12 @@ /* Enable stricter conditions to check bitrate deviations in CBR mode. May compromise * quality to maintain bitrate adherence */ int bStrictCbr; + + /* Enable adaptive quantization at CU granularity. This parameter specifies + * the minimum CU size at which QP can be adjusted, i.e. Quantization Group + * (QG) size. Allowed values are 64, 32, 16 provided it falls within the + * inclusuve range [maxCUSize, minCUSize]. Experimental, default: maxCUSize*/ + uint32_t qgSize; } rc; /*== Video Usability Information ==*/ @@ -1084,6 +1099,22 @@ * conformance cropping window to further crop the displayed window */ int defDispWinBottomOffset; } vui; + + /* SMPTE ST 2086 mastering display color volume SEI info, specified as a + * string which is parsed when the stream header SEI are emitted. The string + * format is "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)" where %hu + * are unsigned 16bit integers and %u are unsigned 32bit integers. The SEI + * includes X,Y display primaries for RGB channels, white point X,Y and + * max,min luminance values. */ + const char* masteringDisplayColorVolume; + + /* Content light level info SEI, specified as a string which is parsed when + * the stream header SEI are emitted. The string format is "%hu,%hu" where + * %hu are unsigned 16bit integers. The first value is the max content light + * level (or 0 if no maximum is indicated), the second value is the maximum + * picture average light level (or 0). */ + const char* contentLightLevelInfo; + } x265_param; /* x265_param_alloc: @@ -1162,12 +1193,10 @@ void x265_picture_init(x265_param *param, x265_picture *pic); /* x265_max_bit_depth: - * Specifies the maximum number of bits per pixel that x265 can input. This - * is also the max bit depth that x265 encodes in. When x265_max_bit_depth - * is 8, the internal and input bit depths must be 8. When - * x265_max_bit_depth is 12, the internal and input bit depths can be - * either 8, 10, or 12. Note that the internal bit depth must be the same - * for all encoders allocated in the same process. */ + * Specifies the numer of bits per pixel that x265 uses internally to + * represent a pixel, and the bit depth of the output bitstream. + * param->internalBitDepth must be set to this value. x265_max_bit_depth + * will be 8 for default builds, 10 for HIGH_BIT_DEPTH builds. */ X265_API extern const int x265_max_bit_depth; /* x265_version_str: @@ -1214,6 +1243,21 @@ * Once flushing has begun, all subsequent calls must pass pic_in as NULL. */ int x265_encoder_encode(x265_encoder *encoder, x265_nal **pp_nal, uint32_t *pi_nal, x265_picture *pic_in, x265_picture *pic_out); +/* x265_encoder_reconfig: + * various parameters from x265_param are copied. + * this takes effect immediately, on whichever frame is encoded next; + * returns 0 on success, negative on parameter validation error. + * + * not all parameters can be changed; see the actual function for a + * detailed breakdown. since not all parameters can be changed, moving + * from preset to preset may not always fully copy all relevant parameters, + * but should still work usably in practice. however, more so than for + * other presets, many of the speed shortcuts used in ultrafast cannot be + * switched out of; using reconfig to switch between ultrafast and other + * presets is not recommended without a more fine-grained breakdown of + * parameters to take this into account. */ +int x265_encoder_reconfig(x265_encoder *, x265_param *); + /* x265_encoder_get_stats: * returns encoder statistics */ void x265_encoder_get_stats(x265_encoder *encoder, x265_stats *, uint32_t statsSizeBytes); @@ -1253,6 +1297,7 @@ void (*picture_init)(x265_param*, x265_picture*); x265_encoder* (*encoder_open)(x265_param*); void (*encoder_parameters)(x265_encoder*, x265_param*); + int (*encoder_reconfig)(x265_encoder*, x265_param*); int (*encoder_headers)(x265_encoder*, x265_nal**, uint32_t*); int (*encoder_encode)(x265_encoder*, x265_nal**, uint32_t*, x265_picture*, x265_picture*); void (*encoder_get_stats)(x265_encoder*, x265_stats*, uint32_t); @@ -1275,8 +1320,14 @@ * Retrieve the programming interface for a linked x265 library. * May return NULL if no library is available that supports the * requested bit depth. If bitDepth is 0 the function is guarunteed - * to return a non-NULL x265_api pointer, from the system default - * libx265 */ + * to return a non-NULL x265_api pointer, from the linked libx265. + * + * If the requested bitDepth is not supported by the linked libx265, + * it will attempt to dynamically bind x265_api_get() from a shared + * library with an appropriate name: + * 8bit: libx265_main.so + * 10bit: libx265_main10.so + * Obviously the shared library file extension is platform specific */ const x265_api* x265_api_get(int bitDepth); #ifdef __cplusplus
View file
x265_1.6.tar.gz/source/x265cli.h -> x265_1.7.tar.gz/source/x265cli.h
Changed
@@ -30,7 +30,7 @@ namespace x265 { #endif -static const char short_options[] = "o:p:f:F:r:I:i:b:s:t:q:m:hwV?"; +static const char short_options[] = "o:D:P:p:f:F:r:I:i:b:s:t:q:m:hwV?"; static const struct option long_options[] = { { "help", no_argument, NULL, 'h' }, @@ -47,16 +47,19 @@ { "no-pme", no_argument, NULL, 0 }, { "pme", no_argument, NULL, 0 }, { "log-level", required_argument, NULL, 0 }, - { "profile", required_argument, NULL, 0 }, + { "profile", required_argument, NULL, 'P' }, { "level-idc", required_argument, NULL, 0 }, { "high-tier", no_argument, NULL, 0 }, { "no-high-tier", no_argument, NULL, 0 }, + { "allow-non-conformance",no_argument, NULL, 0 }, + { "no-allow-non-conformance",no_argument, NULL, 0 }, { "csv", required_argument, NULL, 0 }, { "no-cu-stats", no_argument, NULL, 0 }, { "cu-stats", no_argument, NULL, 0 }, { "y4m", no_argument, NULL, 0 }, { "no-progress", no_argument, NULL, 0 }, { "output", required_argument, NULL, 'o' }, + { "output-depth", required_argument, NULL, 'D' }, { "input", required_argument, NULL, 0 }, { "input-depth", required_argument, NULL, 0 }, { "input-res", required_argument, NULL, 0 }, @@ -181,6 +184,8 @@ { "colormatrix", required_argument, NULL, 0 }, { "chromaloc", required_argument, NULL, 0 }, { "crop-rect", required_argument, NULL, 0 }, + { "master-display", required_argument, NULL, 0 }, + { "max-cll", required_argument, NULL, 0 }, { "no-dither", no_argument, NULL, 0 }, { "dither", no_argument, NULL, 0 }, { "no-repeat-headers", no_argument, NULL, 0 }, @@ -205,6 +210,8 @@ { "strict-cbr", no_argument, NULL, 0 }, { "temporal-layers", no_argument, NULL, 0 }, { "no-temporal-layers", no_argument, NULL, 0 }, + { "qg-size", required_argument, NULL, 0 }, + { "recon-y4m-exec", required_argument, NULL, 0 }, { 0, 0, 0, 0 }, { 0, 0, 0, 0 }, { 0, 0, 0, 0 }, @@ -236,6 +243,7 @@ H0("-V/--version Show version info and exit\n"); H0("\nOutput Options:\n"); H0("-o/--output <filename> Bitstream output file name\n"); + H0("-D/--output-depth 8|10 Output bit depth (also internal bit depth). Default %d\n", param->internalBitDepth); H0(" --log-level <string> Logging level: none error warning info debug full. Default %s\n", x265::logLevelNames[param->logLevel + 1]); H0(" --no-progress Disable CLI progress reports\n"); H0(" --[no-]cu-stats Enable logging stats about distribution of cu across all modes. Default %s\n",OPT(param->bLogCuStats)); @@ -255,9 +263,10 @@ H0(" --[no-]ssim Enable reporting SSIM metric scores. Default %s\n", OPT(param->bEnableSsim)); H0(" --[no-]psnr Enable reporting PSNR metric scores. Default %s\n", OPT(param->bEnablePsnr)); H0("\nProfile, Level, Tier:\n"); - H0(" --profile <string> Enforce an encode profile: main, main10, mainstillpicture\n"); + H0("-P/--profile <string> Enforce an encode profile: main, main10, mainstillpicture\n"); H0(" --level-idc <integer|float> Force a minimum required decoder level (as '5.0' or '50')\n"); H0(" --[no-]high-tier If a decoder level is specified, this modifier selects High tier of that level\n"); + H0(" --[no-]allow-non-conformance Allow the encoder to generate profile NONE bitstreams. Default %s\n", OPT(param->bAllowNonConformance)); H0("\nThreading, performance:\n"); H0(" --pools <integer,...> Comma separated thread count per thread pool (pool per NUMA node)\n"); H0(" '-' implies no threads on node, '+' implies one thread per core on node\n"); @@ -352,12 +361,14 @@ H0(" --analysis-file <filename> Specify file name used for either dumping or reading analysis data.\n"); H0(" --aq-mode <integer> Mode for Adaptive Quantization - 0:none 1:uniform AQ 2:auto variance. Default %d\n", param->rc.aqMode); H0(" --aq-strength <float> Reduces blocking and blurring in flat and textured areas (0 to 3.0). Default %.2f\n", param->rc.aqStrength); + H0(" --qg-size <int> Specifies the size of the quantization group (64, 32, 16). Default %d\n", param->rc.qgSize); H0(" --[no-]cutree Enable cutree for Adaptive Quantization. Default %s\n", OPT(param->rc.cuTree)); H1(" --ipratio <float> QP factor between I and P. Default %.2f\n", param->rc.ipFactor); H1(" --pbratio <float> QP factor between P and B. Default %.2f\n", param->rc.pbFactor); H1(" --qcomp <float> Weight given to predicted complexity. Default %.2f\n", param->rc.qCompress); - H1(" --cbqpoffs <integer> Chroma Cb QP Offset. Default %d\n", param->cbQpOffset); - H1(" --crqpoffs <integer> Chroma Cr QP Offset. Default %d\n", param->crQpOffset); + H1(" --qpstep <integer> The maximum single adjustment in QP allowed to rate control. Default %d\n", param->rc.qpStep); + H1(" --cbqpoffs <integer> Chroma Cb QP Offset [-12..12]. Default %d\n", param->cbQpOffset); + H1(" --crqpoffs <integer> Chroma Cr QP Offset [-12..12]. Default %d\n", param->crQpOffset); H1(" --scaling-list <string> Specify a file containing HM style quant scaling lists or 'default' or 'off'. Default: off\n"); H1(" --lambda-file <string> Specify a file containing replacement values for the lambda tables\n"); H1(" MAX_MAX_QP+1 floats for lambda table, then again for lambda2 table\n"); @@ -384,6 +395,9 @@ H1(" --colormatrix <string> Specify color matrix setting from undef, bt709, fcc, bt470bg, smpte170m,\n"); H1(" smpte240m, GBR, YCgCo, bt2020nc, bt2020c. Default undef\n"); H1(" --chromaloc <integer> Specify chroma sample location (0 to 5). Default of %d\n", param->vui.chromaSampleLocTypeTopField); + H0(" --master-display <string> SMPTE ST 2086 master display color volume info SEI (HDR)\n"); + H0(" format: G(x,y)B(x,y)R(x,y)WP(x,y)L(max,min)\n"); + H0(" --max-cll <string> Emit content light level info SEI as \"cll,fall\" (HDR)\n"); H0("\nBitstream options:\n"); H0(" --[no-]repeat-headers Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders)); H0(" --[no-]info Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI)); @@ -394,6 +408,7 @@ H1("\nReconstructed video options (debugging):\n"); H1("-r/--recon <filename> Reconstructed raw image YUV or Y4M output file name\n"); H1(" --recon-depth <integer> Bit-depth of reconstructed raw image file. Defaults to input bit depth, or 8 if Y4M\n"); + H1(" --recon-y4m-exec <string> pipe reconstructed frames to Y4M viewer, ex:\"ffplay -i pipe:0 -autoexit\"\n"); H1("\nExecutable return codes:\n"); H1(" 0 - encode successful\n"); H1(" 1 - unable to parse command line\n");
Locations
Projects
Search
Status Monitor
Help
Open Build Service
OBS Manuals
API Documentation
OBS Portal
Reporting a Bug
Contact
Mailing List
Forums
Chat (IRC)
Twitter
Open Build Service (OBS)
is an
openSUSE project
.