Projects
Essentials
x265
Sign Up
Log In
Username
Password
We truncated the diff of some files because they were too big. If you want to see the full diff for every file,
click here
.
Overview
Repositories
Revisions
Requests
Users
Attributes
Meta
Expand all
Collapse all
Changes of Revision 9
View file
x265.changes
Changed
@@ -1,4 +1,62 @@ ------------------------------------------------------------------- +Tue Apr 28 20:08:06 UTC 2015 - aloisio@gmx.com + +- soname bumped to 51 +- Update to stable version 1.6 + Perfomance changes: + * heavy improvements for AVX2 capable platforms + (Haswell and later Intel CPUs) and work efficiency + improvements for multiple-socket machines. + + API changes: + * --threads N replaced by --pools N,N and --lookahead-slices N + * --[no-]rdoq-level N - finer control over RDOQ effort + * --min-cu-size N - trade-off compression for performance + * --max-tu-size N - trade-off compression for performance + * --[no-]temporal-layers - code unreferenced B frames in temporal + layer 1 + * --[no-]cip aliases added for --[no-]constrained-intra + * Added support for new color transfer functions "smpte-st-2084" + and "smpte-st-428 + * --limit-refs N was added, but not yet implemented + * Deprecated x265_setup_primitives() was removed from the public + API and is no longer exported DLLs + + Threading changes: + * The x265 thread pool has been made NUMA aware. + * The --threads parameter, which used to specify a global + pool size, has been replaced with a --pools parameter which + allows you to specify a pool size per NUMA node (aka CPU socket + or package). The default is still to allocate one pool worker + thread per logical core on the machine, but with --pools one + can isolate those threads to a given socket. + * Other than socket isolation, the biggest visible change in the + NUMA aware thread pools is the increase in work efficiency. + The total utilization will generally decrease but the performance + will increase since worker threads spend less time context + switching. Also, the threading of the lookahead was made more + work-efficient. Each lookahead job is a much larger piece of work. + Before (1.5): + disable thread pool: --threads 1 + default thread pool: --threads 0 + restrict to 4 threads: --threads 4 + After (1.6): + disable thread pools: --pools 0 + default thread pools: --pools * + restrict to 4 threads: --pools 4 + restrict to 4 threads on socket 1: --pools -,4 + restrict to all threads on socket 0: --pools +,- + + Multi-lib interface: + * In order to support runtime selection of a libx265 + shared library, we have introduced an x265_api structure + and an x265_api_get() function. Applications which use + this interface to acquire the libx265 functional interface + will be able to use shim libraries to bind a particular build + of libx265 at run time. See the API documentation for full + details. + +------------------------------------------------------------------- Sun Feb 22 09:07:11 UTC 2015 - aloisio@gmx.com - soname bump
View file
x265.spec
Changed
@@ -1,10 +1,10 @@ # based on the spec file from https://build.opensuse.org/package/view_file/home:Simmphonie/libx265/ Name: x265 -%define soname 43 +%define soname 51 %define libname lib%{name} %define libsoname %{libname}-%{soname} -Version: 1.5 +Version: 1.6 Release: 0 License: GPL-2.0+ Summary: A free h265/HEVC encoder - encoder binary @@ -45,7 +45,7 @@ %prep %setup -q -n "%{name}_%{version}/build/linux" cd ../.. -%patch0 -p1 +%patch0 cd - %define FAKE_BUILDDATE %(LC_ALL=C date -u -r %{_sourcedir}/%{name}.changes '+%%b %%e %%Y') sed -i -e "s/0.0/%{soname}.0/g" ../../source/cmake/version.cmake
View file
arm.patch
Changed
@@ -1,7 +1,6 @@ -diff -urN a/source/CMakeLists.txt b/source/CMakeLists.txt ---- a/source/CMakeLists.txt 2015-02-10 14:15:13.000000000 -0700 -+++ b/source/CMakeLists.txt 2015-02-12 06:25:01.334927114 -0700 -@@ -46,10 +46,18 @@ +--- source/CMakeLists.txt.orig 2015-04-28 21:43:18.585528552 +0200 ++++ source/CMakeLists.txt 2015-04-28 21:47:14.995334232 +0200 +@@ -50,10 +50,18 @@ set(X64 1) add_definitions(-DX86_64=1) endif() @@ -23,8 +22,8 @@ else() message(STATUS "CMAKE_SYSTEM_PROCESSOR value `${CMAKE_SYSTEM_PROCESSOR}` is unknown") message(STATUS "Please add this value near ${CMAKE_CURRENT_LIST_FILE}:${CMAKE_CURRENT_LIST_LINE}") -@@ -133,8 +141,8 @@ - if(X86 AND NOT X64) +@@ -155,8 +163,8 @@ + elseif(X86 AND NOT X64) add_definitions(-march=i686) endif() - if(ARM) @@ -32,11 +31,10 @@ + if(ARMV7) + add_definitions(-fPIC) endif() - check_cxx_compiler_flag(-Wno-narrowing CC_HAS_NO_NARROWING) - check_cxx_compiler_flag(-Wno-array-bounds CC_HAS_NO_ARRAY_BOUNDS) -diff -urN a/source/common/cpu.cpp b/source/common/cpu.cpp ---- a/source/common/cpu.cpp 2015-02-10 14:15:13.000000000 -0700 -+++ b/source/common/cpu.cpp 2015-02-12 06:25:01.334927114 -0700 + if(FPROFILE_GENERATE) + if(INTEL_CXX) +--- source/common/cpu.cpp.orig 2015-04-28 21:47:44.634923269 +0200 ++++ source/common/cpu.cpp 2015-04-28 21:49:50.305468867 +0200 @@ -37,7 +37,7 @@ #include <machine/cpu.h> #endif
View file
baselibs.conf
Changed
@@ -1,1 +1,1 @@ -libx265-43 +libx265-51
View file
x265_1.5.tar.gz/.hg_archival.txt -> x265_1.6.tar.gz/.hg_archival.txt
Changed
@@ -1,4 +1,4 @@ repo: 09fe40627f03a0f9c3e6ac78b22ac93da23f9fdf -node: 9f0324125f53a12f766f6ed6f98f16e2f42337f4 +node: cbeb7d8a4880e4020c4545dd8e498432c3c6cad3 branch: stable -tag: 1.5 +tag: 1.6
View file
x265_1.5.tar.gz/.hgtags -> x265_1.6.tar.gz/.hgtags
Changed
@@ -13,3 +13,4 @@ d6257335c5370ee54317a0426a12c1f0724b18b9 1.2 c1e4fc0162c14fdb84f5c3bd404fb28cfe10a17f 1.3 5e604833c5aa605d0b6efbe5234492b5e7d8ac61 1.4 +9f0324125f53a12f766f6ed6f98f16e2f42337f4 1.5
View file
x265_1.5.tar.gz/doc/reST/api.rst -> x265_1.6.tar.gz/doc/reST/api.rst
Changed
@@ -72,11 +72,13 @@ process. All of the encoders must use the same maximum CTU size because many global variables are configured based on this size. Encoder allocation will fail if a mis-matched CTU size is attempted. + If no encoders are open, **x265_cleanup()** can be called to reset + the configured CTU size so a new size can be used. An encoder is allocated by calling **x265_encoder_open()**:: /* x265_encoder_open: - * create a new encoder handler, all parameters from x265_param are copied */ + * create a new encoder handler, all parameters from x265_param are copied */ x265_encoder* x265_encoder_open(x265_param *); The returned pointer is then passed to all of the functions pertaining @@ -337,10 +339,44 @@ void x265_encoder_close(x265_encoder *); When the application has completed all encodes, it should call -**x265_cleanup()** to free process global resources like the thread pool; -particularly if a memory-leak detection tool is being used:: +**x265_cleanup()** to free process global, particularly if a memory-leak +detection tool is being used. **x265_cleanup()** also resets the saved +CTU size so it will be possible to create a new encoder with a different +CTU size:: - /*** - * Release library static allocations - */ + /* x265_cleanup: + * release library static allocations, reset configured CTU size */ void x265_cleanup(void); + + +Multi-library Interface +======================= + +If your application might want to make a runtime selection between among +a number of libx265 libraries (perhaps 8bpp and 16bpp), then you will +want to use the multi-library interface. + +Instead of directly using all of the **x265_** methods documented +above, you query an x265_api structure from your libx265 and then use +the function pointers within that structure of the same name, but +without the **x265_** prefix. So **x265_param_default()** becomes +**api->param_default()**. The key method is x265_api_get():: + + /* x265_api_get: + * Retrieve the programming interface for a linked x265 library. + * May return NULL if no library is available that supports the + * requested bit depth. If bitDepth is 0, the function is guarunteed + * to return a non-NULL x265_api pointer from the system default + * libx265 */ + const x265_api* x265_api_get(int bitDepth); + +The general idea is to request the API for the bitDepth you would prefer +the encoder to use (8 or 10), and if that returns NULL you request the +API for bitDepth=0, which returns the system default libx265. + +Note that using this multi-library API in your application is only the +first step. Next your application must dynamically link to libx265 and +then you must build and install a multi-lib configuration of libx265, +which includes 8bpp and 16bpp builds of libx265 and a shim library which +forwards x265_api_get() calls to the appropriate library using dynamic +loading and binding.
View file
x265_1.5.tar.gz/doc/reST/cli.rst -> x265_1.6.tar.gz/doc/reST/cli.rst
Changed
@@ -171,19 +171,54 @@ Over-allocation of frame threads will not improve performance, it will generally just increase memory use. -.. option:: --threads <integer> + **Values:** any value between 8 and 16. Default is 0, auto-detect - Number of threads to allocate for the worker thread pool This pool - is used for WPP and for distributed analysis and motion search: - :option:`--wpp` :option:`--pmode` and :option:`--pme` respectively. +.. option:: --pools <string>, --numa-pools <string> - If :option:`--threads` 1 is specified, then no thread pool is - created. When no thread pool is created, all the thread pool - features are implicitly disabled. If all the pool features are - disabled by the user, then the pool is implicitly disabled. + Comma seperated list of threads per NUMA node. If "none", then no worker + pools are created and only frame parallelism is possible. If NULL or "" + (default) x265 will use all available threads on each NUMA node:: - Default 0, one thread is allocated per detected hardware thread - (logical CPU cores) + '+' is a special value indicating all cores detected on the node + '*' is a special value indicating all cores detected on the node and all remaining nodes + '-' is a special value indicating no cores on the node, same as '0' + + example strings for a 4-node system:: + + "" - default, unspecified, all numa nodes are used for thread pools + "*" - same as default + "none" - no thread pools are created, only frame parallelism possible + "-" - same as "none" + "10" - allocate one pool, using up to 10 cores on node 0 + "-,+" - allocate one pool, using all cores on node 1 + "+,-,+" - allocate two pools, using all cores on nodes 0 and 2 + "+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2 + "-,*" - allocate three pools, using all cores on nodes 1, 2 and 3 + "8,8,8,8" - allocate four pools with up to 8 threads in each pool + + The total number of threads will be determined by the number of threads + assigned to all nodes. The worker threads will each be given affinity for + their node, they will not be allowed to migrate between nodes, but they + will be allowed to move between CPU cores within their node. + + If the three pool features: :option:`--wpp` :option:`--pmode` and + :option:`--pme` are all disabled, then :option:`--pools` is ignored + and no thread pools are created. + + If "none" is specified, then all three of the thread pool features are + implicitly disabled. + + Multiple thread pools will be allocated for any NUMA node with more than + 64 logical CPU cores. But any given thread pool will always use at most + one NUMA node. + + Frame encoders are distributed between the available thread pools, + and the encoder will never generate more thread pools than + :option:`--frame-threads`. The pools are used for WPP and for + distributed analysis and motion search. + + Default "", one thread is allocated per detected hardware thread + (logical CPU cores) and one thread pool per NUMA node. .. option:: --wpp, --no-wpp @@ -409,7 +444,30 @@ If :option:`--level-idc` has been specified, the option adds the intention to support the High tier of that level. If your specified level does not support a High tier, a warning is issued and this - modifier flag is ignored. + modifier flag is ignored. If :option:`--level-idc` has been specified, + but not --high-tier, then the encoder will attempt to encode at the + specified level, main tier first, turning on high tier only if + necessary and available at that level. + +.. option:: --ref <1..16> + + Max number of L0 references to be allowed. This number has a linear + multiplier effect on the amount of work performed in motion search, + but will generally have a beneficial affect on compression and + distortion. + + Note that x265 allows up to 16 L0 references but the HEVC + specification only allows a maximum of 8 total reference frames. So + if you have B frames enabled only 7 L0 refs are valid and if you + have :option:`--b-pyramid` enabled (which is enabled by default in + all presets), then only 6 L0 refs are the maximum allowed by the + HEVC specification. If x265 detects that the total reference count + is greater than 8, it will issue a warning that the resulting stream + is non-compliant and it signals the stream as profile NONE and level + NONE but still allows the encode to continue. Compliant HEVC + decoders may refuse to decode such streams. + + Default 3 .. note:: :option:`--profile`, :option:`--level-idc`, and @@ -444,7 +502,7 @@ +-------+---------------------------------------------------------------+ | 3 | RDO mode and split decisions, chroma residual used for sa8d | +-------+---------------------------------------------------------------+ - | 4 | Adds RDO Quant | + | 4 | Currently same as 3 | +-------+---------------------------------------------------------------+ | 5 | Adds RDO prediction decisions | +-------+---------------------------------------------------------------+ @@ -465,6 +523,23 @@ and less frame parallelism as well. Because of this the faster presets use a CU size of 32. Default: 64 +.. option:: --min-cu-size <64|32|16|8> + + Minimum CU size (width and height). By using 16 or 32 the encoder + will not analyze the cost of CUs below that minimum threshold, + saving considerable amounts of compute with a predictable increase + in bitrate. This setting has a large effect on performance on the + faster presets. + + Default: 8 (minimum 8x8 CU for HEVC, best compression efficiency) + +.. note:: + + All encoders within a single process must use the same settings for + the CU size range. :option:`--ctu` and :option:`--min-cu-size` must + be consistent for all of them since the encoder configures several + key global data structures based on this range. + .. option:: --rect, --no-rect Enable analysis of rectangular motion partitions Nx2N and 2NxN @@ -494,14 +569,6 @@ Measure full CU size (2Nx2N) merge candidates first; if no residual is found the analysis is short circuited. Default disabled -.. option:: --fast-cbf, --no-fast-cbf - - Short circuit analysis if a prediction is found that does not set - the coded block flag (aka: no residual was encoded). It prevents - the encoder from perhaps finding other predictions that also have no - residual but require less signaling bits or have less distortion. - Only applicable for RD levels 5 and 6. Default disabled - .. option:: --fast-intra, --no-fast-intra Perform an initial scan of every fifth intra angular mode, then @@ -526,14 +593,6 @@ Only effective at RD levels 3 and above, which perform RDO mode decisions. -.. option:: --tskip, --no-tskip - - Enable evaluation of transform skip (bypass DCT but still use - quantization) coding for 4x4 TU coded blocks. - - Only effective at RD levels 3 and above, which perform RDO mode - decisions. Default disabled - .. option:: --tskip-fast, --no-tskip-fast Only evaluate transform skip for NxN intra predictions (4x4 blocks). @@ -567,6 +626,30 @@ Options which affect the transform unit quad-tree, sometimes referred to as the residual quad-tree (RQT). +.. option:: --rdoq-level <0|1|2>, --no-rdoq-level + + Specify the amount of rate-distortion analysis to use within + quantization:: + + At level 0 rate-distortion cost is not considered in quant + + At level 1 rate-distortion cost is used to find optimal rounding + values for each level (and allows psy-rdoq to be effective). It + trades-off the signaling cost of the coefficient vs its post-inverse + quant distortion from the pre-quant coefficient. When + :option:`--psy-rdoq` is enabled, this formula is biased in favor of + more energy in the residual (larger coefficient absolute levels) + + At level 2 rate-distortion cost is used to make decimate decisions + on each 4x4 coding group, including the cost of signaling the group + within the group bitmap. If the total distortion of not signaling + the entire coding group is less than the rate cost, the block is + decimated. Next, it applies rate-distortion cost analysis to the + last non-zero coefficient, which can result in many (or all) of the + coding groups being decimated. Psy-rdoq is less effective at + preserving energy when RDOQ is at level 2, since it only has + influence over the level distortion costs. + .. option:: --tu-intra-depth <1..4> The transform unit (residual) quad-tree begins with the same depth @@ -593,9 +676,76 @@ partitions, in which case a TU split is implied and thus the residual quad-tree begins one layer below the CU quad-tree. +.. option:: --nr-intra <integer>, --nr-inter <integer> + + Noise reduction - an adaptive deadzone applied after DCT + (subtracting from DCT coefficients), before quantization. It does + no pixel-level filtering, doesn't cross DCT block boundaries, has no
View file
x265_1.5.tar.gz/doc/reST/presets.rst -> x265_1.6.tar.gz/doc/reST/presets.rst
Changed
@@ -24,19 +24,21 @@ +==============+===========+===========+==========+========+======+========+======+========+==========+=========+ | ctu | 32 | 32 | 32 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| bframes | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 8 | 8 | 8 | +| min-cu-size | 16 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| b-adapt | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | +| bframes | 3 | 3 | 4 | 4 | 4 | 4 | 4 | 8 | 8 | 8 | +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| rc-lookahead | 10 | 10 | 15 | 15 | 15 | 20 | 25 | 30 | 40 | 60 | +| b-adapt | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | ++--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ +| rc-lookahead | 5 | 10 | 15 | 15 | 15 | 20 | 25 | 30 | 40 | 60 | +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ | scenecut | 0 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| refs | 1 | 1 | 1 | 1 | 3 | 3 | 3 | 3 | 5 | 5 | +| refs | 1 | 1 | 1 | 1 | 2 | 3 | 3 | 3 | 5 | 5 | +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ | me | dia | hex | hex | hex | hex | hex | star | star | star | star | +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| merange | 25 | 44 | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 92 | +| merange | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 92 | +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ | subme | 0 | 1 | 1 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ @@ -60,12 +62,14 @@ +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ | weightb | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| aq-mode | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | +| aq-mode | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ | cuTree | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ | rdLevel | 2 | 2 | 2 | 2 | 2 | 3 | 4 | 6 | 6 | 6 | +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ +| rdoq-level | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | ++--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ | tu-intra | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 4 | +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ | tu-inter | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 4 | @@ -114,17 +118,12 @@ modes which preserve high frequency noise: * :option:`--psy-rd` 0.5 + * :option:`--rdoq-level` 1 * :option:`--psy-rdoq` 30 -.. Note:: - - --psy-rdoq is only effective when RDOQuant is enabled, which is at - RD levels 4, 5, and 6 (presets slow and below). - It lowers the strength of adaptive quantization, so residual energy can be more evenly distributed across the (noisy) picture: - * :option:`--aq-mode` 1 * :option:`--aq-strength` 0.3 And it similarly tunes rate control to prevent the slice QP from
View file
x265_1.5.tar.gz/doc/reST/threading.rst -> x265_1.6.tar.gz/doc/reST/threading.rst
Changed
@@ -2,41 +2,34 @@ Threading ********* -Thread Pool -=========== +Thread Pools +============ -x265 creates a pool of worker threads and shares this thread pool -with all encoders within the same process (it is process global, aka a -singleton). The number of threads within the thread pool is determined -by the encoder which first allocates the pool, which by definition is -the first encoder created within each process. +x265 creates one or more thread pools per encoder, one pool per NUMA +node (typically a CPU socket). :option:`--pools` specifies the number of +pools and the number of threads per pool the encoder will allocate. By +default x265 allocates one thread per (hyperthreaded) CPU core on each +NUMA node. -:option:`--threads` specifies the number of threads the encoder will -try to allocate for its thread pool. If the thread pool was already -allocated this parameter is ignored. By default x265 allocates one -thread per (hyperthreaded) CPU core in your system. +If you are running multiple encoders on a system with multiple NUMA +nodes, it is recommended to isolate each of them to a single node in +order to avoid the NUMA overhead of remote memory access. -Work distribution is job based. Idle worker threads ask their parent -pool object for jobs to perform. When no jobs are available, idle -worker threads block and consume no CPU cycles. +Work distribution is job based. Idle worker threads scan the job +providers assigned to their thread pool for jobs to perform. When no +jobs are available, the idle worker threads block and consume no CPU +cycles. Objects which desire to distribute work to worker threads are known as -job providers (and they derive from the JobProvider class). When job -providers have work they enqueue themselves into the pool's provider -list (and dequeue themselves when they no longer have work). The thread +job providers (and they derive from the JobProvider class). The thread pool has a method to **poke** awake a blocked idle thread, and job providers are recommended to call this method when they make new jobs available. Worker jobs are not allowed to block except when abosultely necessary -for data locking. If a job becomes blocked, the worker thread is -expected to drop that job and go back to the pool and find more work. - -.. note:: - - x265_cleanup() frees the process-global thread pool, allowing - it to be reallocated if necessary, but only if no encoders are - allocated at the time it is called. +for data locking. If a job becomes blocked, the work function is +expected to drop that job so the worker thread may go back to the pool +and find more work. Wavefront Parallel Processing ============================= @@ -82,24 +75,35 @@ thread count to be higher than if WPP was enabled. The exact formulas are described in the next section. +Bonded Task Groups +================== + +If a worker thread job has work which can be performed in parallel by +many threads, it may allocate a bonded task group and enlist the help of +other idle worker threads in the same pool. Those threads will cooperate +to complete the work of the bonded task group and then return to their +idle states. The larger and more uniform those tasks are, the better the +bonded task group will perform. + Parallel Mode Analysis -====================== +~~~~~~~~~~~~~~~~~~~~~~ When :option:`--pmode` is enabled, each CU (at all depths from 64x64 to -8x8) will distribute its analysis work to the thread pool. Each analysis -job will measure the cost of one prediction for the CU: merge, skip, -intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). At slower presets, the amount -of increased parallelism is often enough to be able to reduce frame -parallelism while achieving the same overall CPU utilization. Reducing -frame threads is often beneficial to ABR and VBV rate control. +8x8) will distribute its analysis work to the thread pool via a bonded +task group. Each analysis job will measure the cost of one prediction +for the CU: merge, skip, intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). At +slower presets, the amount of increased parallelism is often enough to +be able to reduce frame parallelism while achieving the same overall CPU +utilization. Reducing frame threads is often beneficial to ABR and VBV +rate control. Parallel Motion Estimation -========================== +~~~~~~~~~~~~~~~~~~~~~~~~~~ When :option:`--pme` is enabled all of the analysis functions which perform motion searches to reference frames will distribute those motion -searches as jobs for worker threads (if more than two motion searches -are required). +searches as jobs for worker threads via a bonded task group (if more +than two motion searches are required). Frame Threading =============== @@ -125,16 +129,21 @@ for motion reference must be processed by the loop filters and the loop filters cannot run until a full row has been encoded, and it must run a full row behind the encode process so that the pixels below the row -being filtered are available. When you add up all the row lags each -frame ends up being 3 CTU rows behind its reference frames (the -equivalent of 12 macroblock rows for x264) +being filtered are available. On top of this, HEVC has two loop filters: +deblocking and SAO, which must be run in series with a row lag between +them. When you add up all the row lags each frame ends up being 3 CTU +rows behind its reference frames (the equivalent of 12 macroblock rows +for x264). And keep in mind the wave-front progression pattern; by the +time the reference frame finishes the third row of CTUs, nearly half of +the CTUs in the frame may be compressed (depending on the display aspect +ratio). The third extenuating circumstance is that when a frame being encoded becomes blocked by a reference frame row being available, that frame's wave-front becomes completely stalled and when the row becomes available again it can take quite some time for the wave to be restarted, if it -ever does. This makes WPP many times less effective when frame -parallelism is in use. +ever does. This makes WPP less effective when frame parallelism is in +use. :option:`--merange` can have a negative impact on frame parallelism. If the range is too large, more rows of CTU lag must be added to ensure @@ -213,13 +222,13 @@ The lookahead module of x265 (the lowres pre-encode which determines scene cuts and slice types) uses the thread pool to distribute the -lowres cost analysis to worker threads. It follows the same wave-front -pattern as the main encoder except it works in reverse-scan order. +lowres cost analysis to worker threads. It will use bonded task groups +to perform batches of frame cost estimates, and it may optionally use +bonded task groups to measure single frame cost estimates using slices. -The function slicetypeDecide() itself may also be performed by a worker -thread if your system has enough CPU cores to make this a beneficial -trade-off, else it runs within the context of the thread which calls the -x265_encoder_encode(). +The function slicetypeDecide() itself is also be performed by a worker +thread if your encoder has a thread pool, else it runs within the +context of the thread which calls the x265_encoder_encode(). SAO ===
View file
x265_1.6.tar.gz/readme.rst
Added
@@ -0,0 +1,14 @@ +================= +x265 HEVC Encoder +================= + +| **Read:** | Online `documentation <http://x265.readthedocs.org/en/default/>`_ | Developer `wiki <http://bitbucket.org/multicoreware/x265/wiki/>`_ +| **Download:** | `releases <http://bitbucket.org/multicoreware/x265/downloads/>`_ +| **Interact:** | #x265 on freenode.irc.net | `x265-devel@videolan.org <http://mailman.videolan.org/listinfo/x265-devel>`_ | `Report an issue <https://bitbucket.org/multicoreware/x265/issues?status=new&status=open>`_ + +`x265 <https://www.videolan.org/developers/x265.html>`_ is an open +source HEVC encoder. See the developer wiki for instructions for +downloading and building the source. + +x265 is free to use under the `GNU GPL <http://www.gnu.org/licenses/gpl-2.0.html>`_ +and is also available under a commercial `license <http://x265.org>`_
View file
x265_1.5.tar.gz/source/CMakeLists.txt -> x265_1.6.tar.gz/source/CMakeLists.txt
Changed
@@ -12,6 +12,9 @@ if(POLICY CMP0042) cmake_policy(SET CMP0042 NEW) # MACOSX_RPATH endif() +if(POLICY CMP0054) + cmake_policy(SET CMP0054 OLD) # Only interpret if() arguments as variables or keywords when unquoted +endif() project (x265) cmake_minimum_required (VERSION 2.8.8) # OBJECT libraries require 2.8.8 @@ -20,8 +23,14 @@ include(CheckSymbolExists) include(CheckCXXCompilerFlag) +option(FPROFILE_GENERATE "Compile executable to generate usage data" OFF) +option(FPROFILE_USE "Compile executable using generated usage data" OFF) +option(NATIVE_BUILD "Target the build CPU" OFF) +option(STATIC_LINK_CRT "Statically link C runtime for release builds" OFF) +mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD) + # X265_BUILD must be incremented each time the public API is changed -set(X265_BUILD 43) +set(X265_BUILD 51) configure_file("${PROJECT_SOURCE_DIR}/x265.def.in" "${PROJECT_BINARY_DIR}/x265.def") configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in" @@ -29,11 +38,6 @@ SET(CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake" "${CMAKE_MODULE_PATH}") -option(CHECKED_BUILD "Enable run-time sanity checks (debugging)" OFF) -if(CHECKED_BUILD) - add_definitions(-DCHECKED_BUILD=1) -endif() - # System architecture detection string(TOLOWER "${CMAKE_SYSTEM_PROCESSOR}" SYSPROC) set(X86_ALIASES x86 i386 i686 x86_64 amd64) @@ -61,6 +65,19 @@ if(LIBRT) list(APPEND PLATFORM_LIBS rt) endif() + find_package(Numa) + if(NUMA_FOUND) + list(APPEND CMAKE_REQUIRED_LIBRARIES ${NUMA_LIBRARY}) + check_symbol_exists(numa_node_of_cpu numa.h NUMA_V2) + if(NUMA_V2) + add_definitions(-DHAVE_LIBNUMA) + message(STATUS "libnuma found, building with support for NUMA nodes") + list(APPEND PLATFORM_LIBS ${NUMA_LIBRARY}) + link_directories(${NUMA_LIBRARY_DIR}) + include_directories(${NUMA_INCLUDE_DIR}) + endif() + endif() + mark_as_advanced(LIBRT NUMA_FOUND) endif(UNIX) if(X64 AND NOT WIN32) @@ -77,13 +94,13 @@ add_definitions(-DMACOS) endif() -if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Clang") +if(${CMAKE_CXX_COMPILER_ID} STREQUAL "Clang") set(CLANG 1) endif() -if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Intel") +if(${CMAKE_CXX_COMPILER_ID} STREQUAL "Intel") set(INTEL_CXX 1) endif() -if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU") +if(${CMAKE_CXX_COMPILER_ID} STREQUAL "GNU") set(GCC 1) endif() @@ -92,13 +109,12 @@ set(MSVC 1) endif() if(MSVC) - option(STATIC_LINK_CRT "Statically link C runtime for release builds" OFF) - if (STATIC_LINK_CRT) + if(STATIC_LINK_CRT) set(CompilerFlags CMAKE_CXX_FLAGS_RELEASE CMAKE_C_FLAGS_RELEASE) foreach(CompilerFlag ${CompilerFlags}) string(REPLACE "/MD" "/MT" ${CompilerFlag} "${${CompilerFlag}}") endforeach() - endif (STATIC_LINK_CRT) + endif(STATIC_LINK_CRT) add_definitions(/W4) # Full warnings add_definitions(/Ob2) # always inline add_definitions(/MP) # multithreaded build @@ -130,12 +146,56 @@ if(ENABLE_PIC) add_definitions(-fPIC) endif(ENABLE_PIC) - if(X86 AND NOT X64) + if(NATIVE_BUILD) + if(INTEL_CXX) + add_definitions(-xhost) + else() + add_definitions(-march=native) + endif() + elseif(X86 AND NOT X64) add_definitions(-march=i686) endif() if(ARM) add_definitions(-march=armv6 -mfloat-abi=hard -mfpu=vfp) endif() + if(FPROFILE_GENERATE) + if(INTEL_CXX) + add_definitions(-prof-gen -prof-dir="${CMAKE_CURRENT_BINARY_DIR}") + list(APPEND LINKER_OPTIONS "-prof-gen") + else() + check_cxx_compiler_flag(-fprofile-generate CC_HAS_PROFILE_GENERATE) + if(CC_HAS_PROFILE_GENERATE) + add_definitions(-fprofile-generate) + list(APPEND LINKER_OPTIONS "-fprofile-generate") + endif(CC_HAS_PROFILE_GENERATE) + endif(INTEL_CXX) + endif(FPROFILE_GENERATE) + if(FPROFILE_USE) + if(INTEL_CXX) + add_definitions(-prof-use -prof-dir="${CMAKE_CURRENT_BINARY_DIR}") + list(APPEND LINKER_OPTIONS "-prof-use") + else() + check_cxx_compiler_flag(-fprofile-use CC_HAS_PROFILE_USE) + check_cxx_compiler_flag(-fprofile-correction CC_HAS_PROFILE_CORRECTION) + check_cxx_compiler_flag(-Wno-error=coverage-mismatch CC_HAS_COVMISMATCH) + if(CC_HAS_PROFILE_USE) + add_definitions(-fprofile-use) + list(APPEND LINKER_OPTIONS "-fprofile-use") + endif(CC_HAS_PROFILE_USE) + if(CC_HAS_PROFILE_CORRECTION) + # auto-correct corrupted counters (happens a lot with x265) + add_definitions(-fprofile-correction) + endif(CC_HAS_PROFILE_CORRECTION) + if(CC_HAS_COVMISMATCH) + # ignore coverage mismatches (also happens a lot) + add_definitions(-Wno-error=coverage-mismatch) + endif(CC_HAS_COVMISMATCH) + endif(INTEL_CXX) + endif(FPROFILE_USE) + if(STATIC_LINK_CRT) + add_definitions(-static) + list(APPEND LINKER_OPTIONS "-static") + endif(STATIC_LINK_CRT) check_cxx_compiler_flag(-Wno-narrowing CC_HAS_NO_NARROWING) check_cxx_compiler_flag(-Wno-array-bounds CC_HAS_NO_ARRAY_BOUNDS) if (CC_HAS_NO_ARRAY_BOUNDS) @@ -154,6 +214,35 @@ if(CC_HAS_FNO_EXCEPTIONS_FLAG) add_definitions(-fno-exceptions) endif() + set(FSANITIZE "" CACHE STRING "-fsanitize options for GCC/clang") + if(FSANITIZE) + add_definitions(-fsanitize=${FSANITIZE}) + # clang and gcc need the sanitize options to be passed at link + # time so the appropriate ASAN/TSAN runtime libraries can be + # linked. + list(APPEND LINKER_OPTIONS "-fsanitize=${FSANITIZE}") + endif() + option(ENABLE_AGGRESSIVE_CHECKS "Enable stack protection and -ftrapv" OFF) + if(ENABLE_AGGRESSIVE_CHECKS) + # use with care, -ftrapv can cause testbench SIGILL exceptions + # since it is testing corner cases of signed integer math + add_definitions(-DUSING_FTRAPV=1) + check_cxx_compiler_flag(-fsanitize=undefined-trap CC_HAS_CATCH_UNDEFINED) # clang + check_cxx_compiler_flag(-ftrapv CC_HAS_FTRAPV) # gcc + check_cxx_compiler_flag(-fstack-protector-all CC_HAS_STACK_PROTECT) # gcc + if(CC_HAS_FTRAPV) + add_definitions(-ftrapv) + endif() + if(CC_HAS_CATCH_UNDEFINED) + add_definitions(-fsanitize=undefined-trap -fsanitize-undefined-trap-on-error) + endif() + if(CC_HAS_STACK_PROTECT) + add_definitions(-fstack-protector-all) + if(MINGW) + list(APPEND PLATFORM_LIBS ssp) + endif() + endif() + endif(ENABLE_AGGRESSIVE_CHECKS) execute_process(COMMAND ${CMAKE_CXX_COMPILER} -dumpversion OUTPUT_VARIABLE CC_VERSION) endif(GCC) @@ -168,6 +257,11 @@ endif() endif() +option(CHECKED_BUILD "Enable run-time sanity checks (debugging)" OFF) +if(CHECKED_BUILD) + add_definitions(-DCHECKED_BUILD=1) +endif() + # Build options set(LIB_INSTALL_DIR lib CACHE STRING "Install location of libraries") set(BIN_INSTALL_DIR bin CACHE STRING "Install location of executables") @@ -179,6 +273,7 @@ # can disable this if(X64) check if you desparately need a 32bit # build with 10bit/12bit support, but this violates the "shrink wrap
View file
x265_1.6.tar.gz/source/cmake/FindNuma.cmake
Added
@@ -0,0 +1,43 @@ +# Module for locating libnuma +# +# Read-only variables: +# NUMA_FOUND +# Indicates that the library has been found. +# +# NUMA_INCLUDE_DIR +# Points to the libnuma include directory. +# +# NUMA_LIBRARY_DIR +# Points to the directory that contains the libraries. +# The content of this variable can be passed to link_directories. +# +# NUMA_LIBRARY +# Points to the libnuma that can be passed to target_link_libararies. +# +# Copyright (c) 2015 Steve Borho + +include(FindPackageHandleStandardArgs) + +find_path(NUMA_ROOT_DIR + NAMES include/numa.h + PATHS ENV NUMA_ROOT + DOC "NUMA root directory") + +find_path(NUMA_INCLUDE_DIR + NAMES numa.h + HINTS ${NUMA_ROOT_DIR} + PATH_SUFFIXES include + DOC "NUMA include directory") + +find_library(NUMA_LIBRARY + NAMES numa + HINTS ${NUMA_ROOT_DIR} + DOC "NUMA library") + +if (NUMA_LIBRARY) + get_filename_component(NUMA_LIBRARY_DIR ${NUMA_LIBRARY} PATH) +endif() + +mark_as_advanced(NUMA_INCLUDE_DIR NUMA_LIBRARY_DIR NUMA_LIBRARY) + +find_package_handle_standard_args(NUMA REQUIRED_VARS NUMA_ROOT_DIR NUMA_INCLUDE_DIR NUMA_LIBRARY)
View file
x265_1.5.tar.gz/source/cmake/version.cmake -> x265_1.6.tar.gz/source/cmake/version.cmake
Changed
@@ -10,9 +10,9 @@ set(X265_LATEST_TAG "0.0") set(X265_TAG_DISTANCE "0") -if(EXISTS ${CMAKE_SOURCE_DIR}/../.hg_archival.txt) +if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/../.hg_archival.txt) # read the lines of the archive summary file to extract the version - file(READ ${CMAKE_SOURCE_DIR}/../.hg_archival.txt archive) + file(READ ${CMAKE_CURRENT_SOURCE_DIR}/../.hg_archival.txt archive) STRING(REGEX REPLACE "\n" ";" archive "${archive}") foreach(f ${archive}) string(FIND "${f}" ": " pos) @@ -29,7 +29,7 @@ string(SUBSTRING "${hg_node}" 0 16 hg_id) set(X265_VERSION "${hg_latesttag}+${hg_latesttagdistance}-${hg_id}") endif() -elseif(HG_EXECUTABLE AND EXISTS ${CMAKE_SOURCE_DIR}/../.hg) +elseif(HG_EXECUTABLE AND EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/../.hg) if(EXISTS "${HG_EXECUTABLE}.bat") # mercurial source installs on Windows require .bat extension set(HG_EXECUTABLE "${HG_EXECUTABLE}.bat") @@ -38,14 +38,14 @@ execute_process(COMMAND ${HG_EXECUTABLE} log -r. --template "{latesttag}" - WORKING_DIRECTORY ${PROJECT_SOURCE_DIR} + WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR} OUTPUT_VARIABLE X265_LATEST_TAG ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE ) execute_process(COMMAND ${HG_EXECUTABLE} log -r. --template "{latesttagdistance}" - WORKING_DIRECTORY ${PROJECT_SOURCE_DIR} + WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR} OUTPUT_VARIABLE X265_TAG_DISTANCE ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE @@ -53,7 +53,7 @@ execute_process( COMMAND ${HG_EXECUTABLE} log -r. --template "{node|short}" - WORKING_DIRECTORY ${PROJECT_SOURCE_DIR} + WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR} OUTPUT_VARIABLE HG_REVISION_ID ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE @@ -67,11 +67,11 @@ else() set(X265_VERSION "${X265_LATEST_TAG}+${X265_TAG_DISTANCE}-${HG_REVISION_ID}") endif() -elseif(GIT_EXECUTABLE AND EXISTS ${CMAKE_SOURCE_DIR}/../.git) +elseif(GIT_EXECUTABLE AND EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/../.git) execute_process( COMMAND ${GIT_EXECUTABLE} describe --tags --abbrev=0 - WORKING_DIRECTORY ${PROJECT_SOURCE_DIR} + WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR} OUTPUT_VARIABLE X265_LATEST_TAG ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE @@ -80,7 +80,7 @@ execute_process( COMMAND ${GIT_EXECUTABLE} describe --tags - WORKING_DIRECTORY ${PROJECT_SOURCE_DIR} + WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR} OUTPUT_VARIABLE X265_VERSION ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE
View file
x265_1.5.tar.gz/source/common/CMakeLists.txt -> x265_1.6.tar.gz/source/common/CMakeLists.txt
Changed
@@ -1,7 +1,7 @@ # vim: syntax=cmake if(ENABLE_ASSEMBLY) - set_source_files_properties(primitives.cpp PROPERTIES COMPILE_FLAGS -DENABLE_ASSEMBLY=1) + set_source_files_properties(threading.cpp primitives.cpp PROPERTIES COMPILE_FLAGS -DENABLE_ASSEMBLY=1) set(SSE3 vec/dct-sse3.cpp) set(SSSE3 vec/dct-ssse3.cpp) @@ -48,7 +48,7 @@ if(HIGH_BIT_DEPTH) set(A_SRCS ${A_SRCS} sad16-a.asm intrapred16.asm ipfilter16.asm) else() - set(A_SRCS ${A_SRCS} sad-a.asm intrapred8.asm ipfilter8.asm loopfilter.asm) + set(A_SRCS ${A_SRCS} sad-a.asm intrapred8.asm intrapred8_allangs.asm ipfilter8.asm loopfilter.asm) endif() if(NOT X64)
View file
x265_1.5.tar.gz/source/common/bitstream.cpp -> x265_1.6.tar.gz/source/common/bitstream.cpp
Changed
@@ -27,7 +27,7 @@ uint8_t *temp = X265_MALLOC(uint8_t, m_byteAlloc * 2); if (temp) { - ::memcpy(temp, m_fifo, m_byteOccupancy); + memcpy(temp, m_fifo, m_byteOccupancy); X265_FREE(m_fifo); m_fifo = temp; m_byteAlloc *= 2; @@ -44,7 +44,7 @@ void Bitstream::write(uint32_t val, uint32_t numBits) { X265_CHECK(numBits <= 32, "numBits out of range\n"); - X265_CHECK(numBits == 32 || ((val & (~0 << numBits)) == 0), "numBits & val out of range\n"); + X265_CHECK(numBits == 32 || ((val & (~0u << numBits)) == 0), "numBits & val out of range\n"); uint32_t totalPartialBits = m_partialByteBits + numBits; uint32_t nextPartialBits = totalPartialBits & 7; @@ -55,7 +55,11 @@ { /* topword aligns m_partialByte with the msb of val */ uint32_t topword = (numBits - nextPartialBits) & ~7; +#if USING_FTRAPV + uint32_t write_bits = (topword < 32 ? m_partialByte << topword : 0) | (val >> nextPartialBits); +#else uint32_t write_bits = (m_partialByte << topword) | (val >> nextPartialBits); +#endif switch (writeBytes) {
View file
x265_1.5.tar.gz/source/common/common.cpp -> x265_1.6.tar.gz/source/common/common.cpp
Changed
@@ -33,6 +33,10 @@ #include <sys/time.h> #endif +#if CHECKED_BUILD || _DEBUG +int g_checkFailures; +#endif + int64_t x265_mdate(void) { #if _WIN32
View file
x265_1.5.tar.gz/source/common/common.h -> x265_1.6.tar.gz/source/common/common.h
Changed
@@ -74,13 +74,6 @@ #define ALIGN_VAR_16(T, var) T var __attribute__((aligned(16))) #define ALIGN_VAR_32(T, var) T var __attribute__((aligned(32))) -#if X265_ARCH_X86 && !defined(X86_64) -extern "C" intptr_t x265_stack_align(void (*func)(), ...); -#define x265_stack_align(func, ...) x265_stack_align((void (*)())func, __VA_ARGS__) -#else -#define x265_stack_align(func, ...) func(__VA_ARGS__) -#endif - #if defined(__MINGW32__) #define fseeko fseeko64 #endif @@ -90,7 +83,6 @@ #define ALIGN_VAR_8(T, var) __declspec(align(8)) T var #define ALIGN_VAR_16(T, var) __declspec(align(16)) T var #define ALIGN_VAR_32(T, var) __declspec(align(32)) T var -#define x265_stack_align(func, ...) func(__VA_ARGS__) #define fseeko _fseeki64 #endif // if defined(__GNUC__) @@ -106,19 +98,20 @@ #if _DEBUG && defined(_MSC_VER) #define DEBUG_BREAK() __debugbreak() #elif __APPLE_CC__ -#define DEBUG_BREAK() __builtin_trap(); +#define DEBUG_BREAK() __builtin_trap() #else -#define DEBUG_BREAK() +#define DEBUG_BREAK() abort() #endif /* If compiled with CHECKED_BUILD perform run-time checks and log any that * fail, both to stderr and to a file */ #if CHECKED_BUILD || _DEBUG +extern int g_checkFailures; #define X265_CHECK(expr, ...) if (!(expr)) { \ x265_log(NULL, X265_LOG_ERROR, __VA_ARGS__); \ - DEBUG_BREAK(); \ FILE *fp = fopen("x265_check_failures.txt", "a"); \ if (fp) { fprintf(fp, "%s:%d\n", __FILE__, __LINE__); fprintf(fp, __VA_ARGS__); fclose(fp); } \ + g_checkFailures++; DEBUG_BREAK(); \ } #if _MSC_VER #pragma warning(disable: 4127) // some checks have constant conditions @@ -257,7 +250,7 @@ #define UNIT_SIZE (1 << LOG2_UNIT_SIZE) // unit size of CU partition #define MAX_NUM_PARTITIONS 256 -#define NUM_CU_PARTITIONS (1U << (g_maxFullDepth << 1)) +#define NUM_4x4_PARTITIONS (1U << (g_unitSizeDepth << 1)) // number of 4x4 units in max CU size #define MIN_PU_SIZE 4 #define MIN_TU_SIZE 4 @@ -376,6 +369,7 @@ int32_t* ref; uint8_t* depth; uint8_t* modes; + uint32_t* bestMergeCand; }; /* Stores intra analysis data for a single frame. This struct needs better packing */ @@ -384,6 +378,7 @@ uint8_t* depth; uint8_t* modes; char* partSizes; + uint8_t* chromaModes; }; enum TextType @@ -430,6 +425,8 @@ void x265_free(void *ptr); char* x265_slurp_file(const char *filename); +void x265_setup_primitives(x265_param* param, int cpu); /* primitives.cpp */ + #include "constants.h" #endif // ifndef X265_COMMON_H
View file
x265_1.5.tar.gz/source/common/constants.cpp -> x265_1.6.tar.gz/source/common/constants.cpp
Changed
@@ -119,9 +119,10 @@ 65535 }; +int g_ctuSizeConfigured = 0; uint32_t g_maxLog2CUSize = MAX_LOG2_CU_SIZE; uint32_t g_maxCUSize = MAX_CU_SIZE; -uint32_t g_maxFullDepth = NUM_FULL_DEPTH - 1; +uint32_t g_unitSizeDepth = NUM_CU_DEPTH; uint32_t g_maxCUDepth = NUM_CU_DEPTH - 1; uint32_t g_zscanToRaster[MAX_NUM_PARTITIONS] = { 0, }; uint32_t g_rasterToZscan[MAX_NUM_PARTITIONS] = { 0, };
View file
x265_1.5.tar.gz/source/common/constants.h -> x265_1.6.tar.gz/source/common/constants.h
Changed
@@ -29,6 +29,8 @@ namespace x265 { // private namespace +extern int g_ctuSizeConfigured; + void initZscanToRaster(uint32_t maxFullDepth, uint32_t depth, uint32_t startVal, uint32_t*& curIdx); void initRasterToZscan(uint32_t maxFullDepth); @@ -55,7 +57,7 @@ extern uint32_t g_maxLog2CUSize; extern uint32_t g_maxCUSize; extern uint32_t g_maxCUDepth; -extern uint32_t g_maxFullDepth; +extern uint32_t g_unitSizeDepth; // Depth at which 4x4 unit occurs from max CU size extern const int16_t g_t4[4][4]; extern const int16_t g_t8[8][8];
View file
x265_1.5.tar.gz/source/common/cudata.cpp -> x265_1.6.tar.gz/source/common/cudata.cpp
Changed
@@ -38,7 +38,7 @@ void bcast1(uint8_t* dst, uint8_t val) { dst[0] = val; } void copy4(uint8_t* dst, uint8_t* src) { ((uint32_t*)dst)[0] = ((uint32_t*)src)[0]; } -void bcast4(uint8_t* dst, uint8_t val) { ((uint32_t*)dst)[0] = 0x01010101 * val; } +void bcast4(uint8_t* dst, uint8_t val) { ((uint32_t*)dst)[0] = 0x01010101u * val; } void copy16(uint8_t* dst, uint8_t* src) { ((uint64_t*)dst)[0] = ((uint64_t*)src)[0]; ((uint64_t*)dst)[1] = ((uint64_t*)src)[1]; } void bcast16(uint8_t* dst, uint8_t val) { uint64_t bval = 0x0101010101010101ULL * val; ((uint64_t*)dst)[0] = bval; ((uint64_t*)dst)[1] = bval; } @@ -159,11 +159,11 @@ m_chromaFormat = csp; m_hChromaShift = CHROMA_H_SHIFT(csp); m_vChromaShift = CHROMA_V_SHIFT(csp); - m_numPartitions = NUM_CU_PARTITIONS >> (depth * 2); + m_numPartitions = NUM_4x4_PARTITIONS >> (depth * 2); if (!s_partSet[0]) { - s_numPartInCUSize = 1 << g_maxFullDepth; + s_numPartInCUSize = 1 << g_unitSizeDepth; switch (g_maxLog2CUSize) { case 6: @@ -272,7 +272,7 @@ m_cuPelX = (cuAddr % m_slice->m_sps->numCuInWidth) << g_maxLog2CUSize; m_cuPelY = (cuAddr / m_slice->m_sps->numCuInWidth) << g_maxLog2CUSize; m_absIdxInCTU = 0; - m_numPartitions = NUM_CU_PARTITIONS; + m_numPartitions = NUM_4x4_PARTITIONS; /* sequential memsets */ m_partSet((uint8_t*)m_qp, (uint8_t)qp); @@ -300,12 +300,12 @@ // initialize Sub partition void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom) { - m_absIdxInCTU = cuGeom.encodeIdx; + m_absIdxInCTU = cuGeom.absPartIdx; m_encData = ctu.m_encData; m_slice = ctu.m_slice; m_cuAddr = ctu.m_cuAddr; - m_cuPelX = ctu.m_cuPelX + g_zscanToPelX[cuGeom.encodeIdx]; - m_cuPelY = ctu.m_cuPelY + g_zscanToPelY[cuGeom.encodeIdx]; + m_cuPelX = ctu.m_cuPelX + g_zscanToPelX[cuGeom.absPartIdx]; + m_cuPelY = ctu.m_cuPelY + g_zscanToPelY[cuGeom.absPartIdx]; m_cuLeft = ctu.m_cuLeft; m_cuAbove = ctu.m_cuAbove; m_cuAboveLeft = ctu.m_cuAboveLeft; @@ -392,7 +392,7 @@ m_cuAbove = cu.m_cuAbove; m_cuAboveLeft = cu.m_cuAboveLeft; m_cuAboveRight = cu.m_cuAboveRight; - m_absIdxInCTU = cuGeom.encodeIdx; + m_absIdxInCTU = cuGeom.absPartIdx; m_numPartitions = cuGeom.numPartitions; memcpy(m_qp, cu.m_qp, BytesPerPartition * m_numPartitions); memcpy(m_mv[0], cu.m_mv[0], m_numPartitions * sizeof(MV)); @@ -462,9 +462,9 @@ m_encData = ctu.m_encData; m_slice = ctu.m_slice; m_cuAddr = ctu.m_cuAddr; - m_cuPelX = ctu.m_cuPelX + g_zscanToPelX[cuGeom.encodeIdx]; - m_cuPelY = ctu.m_cuPelY + g_zscanToPelY[cuGeom.encodeIdx]; - m_absIdxInCTU = cuGeom.encodeIdx; + m_cuPelX = ctu.m_cuPelX + g_zscanToPelX[cuGeom.absPartIdx]; + m_cuPelY = ctu.m_cuPelY + g_zscanToPelY[cuGeom.absPartIdx]; + m_absIdxInCTU = cuGeom.absPartIdx; m_numPartitions = cuGeom.numPartitions; /* copy out all prediction info for this part */ @@ -559,7 +559,7 @@ return this; } - aPartUnitIdx = g_rasterToZscan[absPartIdx + NUM_CU_PARTITIONS - s_numPartInCUSize]; + aPartUnitIdx = g_rasterToZscan[absPartIdx + NUM_4x4_PARTITIONS - s_numPartInCUSize]; return m_cuAbove; } @@ -581,7 +581,7 @@ return this; } } - alPartUnitIdx = g_rasterToZscan[absPartIdx + NUM_CU_PARTITIONS - s_numPartInCUSize - 1]; + alPartUnitIdx = g_rasterToZscan[absPartIdx + NUM_4x4_PARTITIONS - s_numPartInCUSize - 1]; return m_cuAbove; } @@ -591,7 +591,7 @@ return m_cuLeft; } - alPartUnitIdx = g_rasterToZscan[NUM_CU_PARTITIONS - 1]; + alPartUnitIdx = g_rasterToZscan[NUM_4x4_PARTITIONS - 1]; return m_cuAboveLeft; } @@ -620,14 +620,14 @@ } return NULL; } - arPartUnitIdx = g_rasterToZscan[absPartIdxRT + NUM_CU_PARTITIONS - s_numPartInCUSize + 1]; + arPartUnitIdx = g_rasterToZscan[absPartIdxRT + NUM_4x4_PARTITIONS - s_numPartInCUSize + 1]; return m_cuAbove; } if (!isZeroRow(absPartIdxRT, s_numPartInCUSize)) return NULL; - arPartUnitIdx = g_rasterToZscan[NUM_CU_PARTITIONS - s_numPartInCUSize]; + arPartUnitIdx = g_rasterToZscan[NUM_4x4_PARTITIONS - s_numPartInCUSize]; return m_cuAboveRight; } @@ -720,21 +720,21 @@ } return NULL; } - arPartUnitIdx = g_rasterToZscan[absPartIdxRT + NUM_CU_PARTITIONS - s_numPartInCUSize + partUnitOffset]; + arPartUnitIdx = g_rasterToZscan[absPartIdxRT + NUM_4x4_PARTITIONS - s_numPartInCUSize + partUnitOffset]; return m_cuAbove; } if (!isZeroRow(absPartIdxRT, s_numPartInCUSize)) return NULL; - arPartUnitIdx = g_rasterToZscan[NUM_CU_PARTITIONS - s_numPartInCUSize + partUnitOffset - 1]; + arPartUnitIdx = g_rasterToZscan[NUM_4x4_PARTITIONS - s_numPartInCUSize + partUnitOffset - 1]; return m_cuAboveRight; } /* Get left QpMinCu */ const CUData* CUData::getQpMinCuLeft(uint32_t& lPartUnitIdx, uint32_t curAbsIdxInCTU) const { - uint32_t absZorderQpMinCUIdx = curAbsIdxInCTU & (0xFF << (g_maxFullDepth - m_slice->m_pps->maxCuDQPDepth) * 2); + uint32_t absZorderQpMinCUIdx = curAbsIdxInCTU & (0xFF << (g_unitSizeDepth - m_slice->m_pps->maxCuDQPDepth) * 2); uint32_t absRorderQpMinCUIdx = g_zscanToRaster[absZorderQpMinCUIdx]; // check for left CTU boundary @@ -751,7 +751,7 @@ /* Get above QpMinCu */ const CUData* CUData::getQpMinCuAbove(uint32_t& aPartUnitIdx, uint32_t curAbsIdxInCTU) const { - uint32_t absZorderQpMinCUIdx = curAbsIdxInCTU & (0xFF << (g_maxFullDepth - m_slice->m_pps->maxCuDQPDepth) * 2); + uint32_t absZorderQpMinCUIdx = curAbsIdxInCTU & (0xFF << (g_unitSizeDepth - m_slice->m_pps->maxCuDQPDepth) * 2); uint32_t absRorderQpMinCUIdx = g_zscanToRaster[absZorderQpMinCUIdx]; // check for top CTU boundary @@ -790,7 +790,7 @@ int8_t CUData::getLastCodedQP(uint32_t absPartIdx) const { - uint32_t quPartIdxMask = 0xFF << (g_maxFullDepth - m_slice->m_pps->maxCuDQPDepth) * 2; + uint32_t quPartIdxMask = 0xFF << (g_unitSizeDepth - m_slice->m_pps->maxCuDQPDepth) * 2; int lastValidPartIdx = getLastValidPartIdx(absPartIdx & quPartIdxMask); if (lastValidPartIdx >= 0) @@ -800,7 +800,7 @@ if (m_absIdxInCTU) return m_encData->getPicCTU(m_cuAddr)->getLastCodedQP(m_absIdxInCTU); else if (m_cuAddr > 0 && !(m_slice->m_pps->bEntropyCodingSyncEnabled && !(m_cuAddr % m_slice->m_sps->numCuInWidth))) - return m_encData->getPicCTU(m_cuAddr - 1)->getLastCodedQP(NUM_CU_PARTITIONS); + return m_encData->getPicCTU(m_cuAddr - 1)->getLastCodedQP(NUM_4x4_PARTITIONS); else return (int8_t)m_slice->m_sliceQp; } @@ -932,7 +932,7 @@ bool CUData::setQPSubCUs(int8_t qp, uint32_t absPartIdx, uint32_t depth) { - uint32_t curPartNumb = NUM_CU_PARTITIONS >> (depth << 1); + uint32_t curPartNumb = NUM_4x4_PARTITIONS >> (depth << 1); uint32_t curPartNumQ = curPartNumb >> 2; if (m_cuDepth[absPartIdx] > depth) @@ -1375,8 +1375,8 @@ return true; } -/* Construct list of merging candidates */ -uint32_t CUData::getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField(*mvFieldNeighbours)[2], uint8_t* interDirNeighbours) const +/* Construct list of merging candidates, returns count */ +uint32_t CUData::getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField(*candMvField)[2], uint8_t* candDir) const { uint32_t absPartAddr = m_absIdxInCTU + absPartIdx; const bool isInterB = m_slice->isInterB(); @@ -1385,10 +1385,10 @@ for (uint32_t i = 0; i < maxNumMergeCand; ++i) { - mvFieldNeighbours[i][0].mv = 0; - mvFieldNeighbours[i][1].mv = 0; - mvFieldNeighbours[i][0].refIdx = REF_NOT_VALID; - mvFieldNeighbours[i][1].refIdx = REF_NOT_VALID; + candMvField[i][0].mv = 0; + candMvField[i][1].mv = 0; + candMvField[i][0].refIdx = REF_NOT_VALID; + candMvField[i][1].refIdx = REF_NOT_VALID; }
View file
x265_1.5.tar.gz/source/common/cudata.h -> x265_1.6.tar.gz/source/common/cudata.h
Changed
@@ -64,7 +64,8 @@ MD_ABOVE, // MVP of above block MD_ABOVE_RIGHT, // MVP of above right block MD_BELOW_LEFT, // MVP of below left block - MD_ABOVE_LEFT // MVP of above left block + MD_ABOVE_LEFT, // MVP of above left block + MD_COLLOCATED // MVP of temporal neighbour }; struct CUGeom @@ -82,7 +83,7 @@ uint32_t log2CUSize; // Log of the CU size. uint32_t childOffset; // offset of the first child CU from current CU - uint32_t encodeIdx; // Encoding index of this CU in terms of 4x4 blocks. + uint32_t absPartIdx; // Part index of this CU in terms of 4x4 blocks. uint32_t numPartitions; // Number of 4x4 blocks in the CU uint32_t depth; // depth of this CU relative from CTU uint32_t flags; // CU flags. @@ -94,6 +95,26 @@ int refIdx; }; +// Structure that keeps the neighbour's MV information. +struct InterNeighbourMV +{ + // Neighbour MV. The index represents the list. + MV mv[2]; + + // Collocated right bottom CU addr. + uint32_t cuAddr[2]; + + // For spatial prediction, this field contains the reference index + // in each list (-1 if not available). + // + // For temporal prediction, the first value is used for the + // prediction with list 0. The second value is used for the prediction + // with list 1. For each value, the first four bits are the reference index + // associated to the PMV, and the fifth bit is the list associated to the PMV. + // if both reference indices are -1, then unifiedRef is also -1 + union { int16_t refIdx[2]; int32_t unifiedRef; }; +}; + typedef void(*cucopy_t)(uint8_t* dst, uint8_t* src); // dst and src are aligned to MIN(size, 32) typedef void(*cubcast_t)(uint8_t* dst, uint8_t val); // dst is aligned to MIN(size, 32) @@ -122,9 +143,9 @@ uint32_t m_cuPelY; // CU position within the picture, in pixels (Y) uint32_t m_numPartitions; // maximum number of 4x4 partitions within this CU - int m_chromaFormat; - int m_hChromaShift; - int m_vChromaShift; + uint32_t m_chromaFormat; + uint32_t m_hChromaShift; + uint32_t m_vChromaShift; /* Per-part data, stored contiguously */ int8_t* m_qp; // array of QP values @@ -158,7 +179,7 @@ CUData(); void initialize(const CUDataMemPool& dataPool, uint32_t depth, int csp, int instance); - static void calcCTUGeoms(uint32_t ctuWidth, uint32_t ctuHeight, uint32_t maxCUSize, CUGeom cuDataArray[CUGeom::MAX_GEOMS]); + static void calcCTUGeoms(uint32_t ctuWidth, uint32_t ctuHeight, uint32_t maxCUSize, uint32_t minCUSize, CUGeom cuDataArray[CUGeom::MAX_GEOMS]); void initCTU(const Frame& frame, uint32_t cuAddr, int qp); void initSubCU(const CUData& ctu, const CUGeom& cuGeom); @@ -195,9 +216,10 @@ uint8_t getCbf(uint32_t absPartIdx, TextType ttype, uint32_t tuDepth) const { return (m_cbf[ttype][absPartIdx] >> tuDepth) & 0x1; } uint8_t getQtRootCbf(uint32_t absPartIdx) const { return m_cbf[0][absPartIdx] || m_cbf[1][absPartIdx] || m_cbf[2][absPartIdx]; } int8_t getRefQP(uint32_t currAbsIdxInCTU) const; - uint32_t getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField (*mvFieldNeighbours)[2], uint8_t* interDirNeighbours) const; + uint32_t getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField (*candMvField)[2], uint8_t* candDir) const; void clipMv(MV& outMV) const; - int fillMvpCand(uint32_t puIdx, uint32_t absPartIdx, int picList, int refIdx, MV* amvpCand, MV* mvc) const; + int getPMV(InterNeighbourMV *neighbours, uint32_t reference_list, uint32_t refIdx, MV* amvpCand, MV* pmv) const; + void getNeighbourMV(uint32_t puIdx, uint32_t absPartIdx, InterNeighbourMV* neighbours) const; void getIntraTUQtDepthRange(uint32_t tuDepthRange[2], uint32_t absPartIdx) const; void getInterTUQtDepthRange(uint32_t tuDepthRange[2], uint32_t absPartIdx) const; @@ -213,10 +235,9 @@ void getAllowedChromaDir(uint32_t absPartIdx, uint32_t* modeList) const; int getIntraDirLumaPredictor(uint32_t absPartIdx, uint32_t* intraDirPred) const; - uint32_t getSCUAddr() const { return (m_cuAddr << g_maxFullDepth * 2) + m_absIdxInCTU; } + uint32_t getSCUAddr() const { return (m_cuAddr << g_unitSizeDepth * 2) + m_absIdxInCTU; } uint32_t getCtxSplitFlag(uint32_t absPartIdx, uint32_t depth) const; uint32_t getCtxSkipFlag(uint32_t absPartIdx) const; - ScanType getCoefScanIdx(uint32_t absPartIdx, uint32_t log2TrSize, bool bIsLuma, bool bIsIntra) const; void getTUEntropyCodingParameters(TUEntropyCodingParameters &result, uint32_t absPartIdx, uint32_t log2TrSize, bool bIsLuma) const; const CUData* getPULeft(uint32_t& lPartUnitIdx, uint32_t curPartUnitIdx) const; @@ -241,15 +262,18 @@ bool hasEqualMotion(uint32_t absPartIdx, const CUData& candCU, uint32_t candAbsPartIdx) const; - bool isDiffMER(int xN, int yN, int xP, int yP) const; + /* Check whether the current PU and a spatial neighboring PU are in same merge region */ + bool isDiffMER(int xN, int yN, int xP, int yP) const { return ((xN >> 2) != (xP >> 2)) || ((yN >> 2) != (yP >> 2)); } // add possible motion vector predictor candidates - bool addMVPCand(MV& mvp, int picList, int refIdx, uint32_t absPartIdx, MVP_DIR dir) const; - bool addMVPCandOrder(MV& mvp, int picList, int refIdx, uint32_t absPartIdx, MVP_DIR dir) const; + bool getDirectPMV(MV& pmv, InterNeighbourMV *neighbours, uint32_t picList, uint32_t refIdx) const; + bool getIndirectPMV(MV& outMV, InterNeighbourMV *neighbours, uint32_t reference_list, uint32_t refIdx) const; + void getInterNeighbourMV(InterNeighbourMV *neighbour, uint32_t partUnitIdx, MVP_DIR dir) const; bool getColMVP(MV& outMV, int& outRefIdx, int picList, int cuAddr, int absPartIdx) const; + bool getCollocatedMV(int cuAddr, int partUnitIdx, InterNeighbourMV *neighbour) const; - void scaleMvByPOCDist(MV& outMV, const MV& inMV, int curPOC, int curRefPOC, int colPOC, int colRefPOC) const; + MV scaleMvByPOCDist(const MV& inMV, int curPOC, int curRefPOC, int colPOC, int colRefPOC) const; void deriveLeftRightTopIdx(uint32_t puIdx, uint32_t& partIdxLT, uint32_t& partIdxRT) const; @@ -278,7 +302,7 @@ bool create(uint32_t depth, uint32_t csp, uint32_t numInstances) { - uint32_t numPartition = NUM_CU_PARTITIONS >> (depth * 2); + uint32_t numPartition = NUM_4x4_PARTITIONS >> (depth * 2); uint32_t cuSize = g_maxCUSize >> depth; uint32_t sizeL = cuSize * cuSize; uint32_t sizeC = sizeL >> (CHROMA_H_SHIFT(csp) + CHROMA_V_SHIFT(csp));
View file
x265_1.5.tar.gz/source/common/dct.cpp -> x265_1.6.tar.gz/source/common/dct.cpp
Changed
@@ -709,14 +709,12 @@ return numSig; } - -int count_nonzero_c(const int16_t* quantCoeff, int numCoeff) +template<int trSize> +int count_nonzero_c(const int16_t* quantCoeff) { X265_CHECK(((intptr_t)quantCoeff & 15) == 0, "quant buffer not aligned\n"); - X265_CHECK(numCoeff > 0 && (numCoeff & 15) == 0, "numCoeff invalid %d\n", numCoeff); - int count = 0; - + int numCoeff = trSize * trSize; for (int i = 0; i < numCoeff; i++) { count += quantCoeff[i] != 0; @@ -754,6 +752,39 @@ } } +int findPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig) +{ + memset(coeffNum, 0, MLS_GRP_NUM * sizeof(*coeffNum)); + memset(coeffFlag, 0, MLS_GRP_NUM * sizeof(*coeffFlag)); + memset(coeffSign, 0, MLS_GRP_NUM * sizeof(*coeffSign)); + + int scanPosLast = 0; + do + { + const uint32_t cgIdx = (uint32_t)scanPosLast >> MLS_CG_SIZE; + + const uint32_t posLast = scan[scanPosLast++]; + + const int curCoeff = coeff[posLast]; + const uint32_t isNZCoeff = (curCoeff != 0); + // get L1 sig map + // NOTE: the new algorithm is complicated, so I keep reference code here + //uint32_t posy = posLast >> log2TrSize; + //uint32_t posx = posLast - (posy << log2TrSize); + //uint32_t blkIdx0 = ((posy >> MLS_CG_LOG2_SIZE) << codingParameters.log2TrSizeCG) + (posx >> MLS_CG_LOG2_SIZE); + //const uint32_t blkIdx = ((posLast >> (2 * MLS_CG_LOG2_SIZE)) & ~maskPosXY) + ((posLast >> MLS_CG_LOG2_SIZE) & maskPosXY); + //sigCoeffGroupFlag64 |= ((uint64_t)isNZCoeff << blkIdx); + numSig -= isNZCoeff; + + // TODO: optimize by instruction BTS + coeffSign[cgIdx] += (uint16_t)(((uint32_t)curCoeff >> 31) << coeffNum[cgIdx]); + coeffFlag[cgIdx] = (coeffFlag[cgIdx] << 1) + (uint16_t)isNZCoeff; + coeffNum[cgIdx] += (uint8_t)isNZCoeff; + } + while (numSig > 0); + return scanPosLast - 1; +} + } // closing - anonymous file-static namespace namespace x265 { @@ -775,12 +806,17 @@ p.cu[BLOCK_8x8].idct = idct8_c; p.cu[BLOCK_16x16].idct = idct16_c; p.cu[BLOCK_32x32].idct = idct32_c; - p.count_nonzero = count_nonzero_c; p.denoiseDct = denoiseDct_c; + p.cu[BLOCK_4x4].count_nonzero = count_nonzero_c<4>; + p.cu[BLOCK_8x8].count_nonzero = count_nonzero_c<8>; + p.cu[BLOCK_16x16].count_nonzero = count_nonzero_c<16>; + p.cu[BLOCK_32x32].count_nonzero = count_nonzero_c<32>; p.cu[BLOCK_4x4].copy_cnt = copy_count<4>; p.cu[BLOCK_8x8].copy_cnt = copy_count<8>; p.cu[BLOCK_16x16].copy_cnt = copy_count<16>; p.cu[BLOCK_32x32].copy_cnt = copy_count<32>; + + p.findPosLast = findPosLast_c; } }
View file
x265_1.5.tar.gz/source/common/deblock.cpp -> x265_1.6.tar.gz/source/common/deblock.cpp
Changed
@@ -70,7 +70,7 @@ * param Edge the direction of the edge in block boundary (horizonta/vertical), which is added newly */ void Deblock::deblockCU(const CUData* cu, const CUGeom& cuGeom, const int32_t dir, uint8_t blockStrength[]) { - uint32_t absPartIdx = cuGeom.encodeIdx; + uint32_t absPartIdx = cuGeom.absPartIdx; uint32_t depth = cuGeom.depth; if (cu->m_predMode[absPartIdx] == MODE_NONE) return; @@ -358,7 +358,7 @@ int16_t m5 = (int16_t)src[offset]; int16_t m2 = (int16_t)src[-offset * 2]; - int32_t delta = x265_clip3(-tc, tc, ((((m4 - m3) << 2) + m2 - m5 + 4) >> 3)); + int32_t delta = x265_clip3(-tc, tc, ((((m4 - m3) * 4) + m2 - m5 + 4) >> 3)); src[-offset] = x265_clip(m3 + (delta & maskP)); src[0] = x265_clip(m4 - (delta & maskQ)); }
View file
x265_1.5.tar.gz/source/common/framedata.h -> x265_1.6.tar.gz/source/common/framedata.h
Changed
@@ -32,6 +32,7 @@ // private namespace class PicYuv; +class JobProvider; /* Per-frame data that is used during encodes and referenced while the picture * is available for reference. A FrameData instance is attached to a Frame as it @@ -52,6 +53,7 @@ PicYuv* m_reconPic; bool m_bHasReferences; /* used during DPB/RPS updates */ int m_frameEncoderID; /* the ID of the FrameEncoder encoding this frame */ + JobProvider* m_jobProvider; CUDataMemPool m_cuMemPool; CUData* m_picCTU;
View file
x265_1.5.tar.gz/source/common/intrapred.cpp -> x265_1.6.tar.gz/source/common/intrapred.cpp
Changed
@@ -27,6 +27,29 @@ using namespace x265; namespace { + +template<int tuSize> +void intraFilter(const pixel* samples, pixel* filtered) /* 1:2:1 filtering of left and top reference samples */ +{ + const int tuSize2 = tuSize << 1; + + pixel topLeft = samples[0], topLast = samples[tuSize2], leftLast = samples[tuSize2 + tuSize2]; + + // filtering top + for (int i = 1; i < tuSize2; i++) + filtered[i] = ((samples[i] << 1) + samples[i - 1] + samples[i + 1] + 2) >> 2; + filtered[tuSize2] = topLast; + + // filtering top-left + filtered[0] = ((topLeft << 1) + samples[1] + samples[tuSize2 + 1] + 2) >> 2; + + // filtering left + filtered[tuSize2 + 1] = ((samples[tuSize2 + 1] << 1) + topLeft + samples[tuSize2 + 2] + 2) >> 2; + for (int i = tuSize2 + 2; i < tuSize2 + tuSize2; i++) + filtered[i] = ((samples[i] << 1) + samples[i - 1] + samples[i + 1] + 2) >> 2; + filtered[tuSize2 + tuSize2] = leftLast; +} + void dcPredFilter(const pixel* above, const pixel* left, pixel* dst, intptr_t dststride, int size) { // boundary pixels processing @@ -216,6 +239,11 @@ void setupIntraPrimitives_c(EncoderPrimitives& p) { + p.cu[BLOCK_4x4].intra_filter = intraFilter<4>; + p.cu[BLOCK_8x8].intra_filter = intraFilter<8>; + p.cu[BLOCK_16x16].intra_filter = intraFilter<16>; + p.cu[BLOCK_32x32].intra_filter = intraFilter<32>; + p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = planar_pred_c<2>; p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = planar_pred_c<3>; p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = planar_pred_c<4>;
View file
x265_1.5.tar.gz/source/common/ipfilter.cpp -> x265_1.6.tar.gz/source/common/ipfilter.cpp
Changed
@@ -34,8 +34,27 @@ #endif namespace { +template<int dstStride, int width, int height> +void pixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst) +{ + int shift = IF_INTERNAL_PREC - X265_DEPTH; + int row, col; + + for (row = 0; row < height; row++) + { + for (col = 0; col < width; col++) + { + int16_t val = src[col] << shift; + dst[col] = val - (int16_t)IF_INTERNAL_OFFS; + } + + src += srcStride; + dst += dstStride; + } +} + template<int dstStride> -void filterConvertPelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height) +void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height) { int shift = IF_INTERNAL_PREC - X265_DEPTH; int row, col; @@ -65,8 +84,8 @@ } #else - ::memset(txt - marginX, txt[0], marginX); - ::memset(txt + width, txt[width - 1], marginX); + memset(txt - marginX, txt[0], marginX); + memset(txt + width, txt[width - 1], marginX); #endif txt += stride; @@ -378,7 +397,8 @@ p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vpp = interp_vert_pp_c<4, W, H>; \ p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>; \ p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; #define CHROMA_422(W, H) \ p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \ @@ -386,7 +406,8 @@ p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vpp = interp_vert_pp_c<4, W, H>; \ p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>; \ p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; #define CHROMA_444(W, H) \ p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \ @@ -394,7 +415,8 @@ p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vpp = interp_vert_pp_c<4, W, H>; \ p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>; \ p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>; \ - p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; + p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \ + p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>; #define LUMA(W, H) \ p.pu[LUMA_ ## W ## x ## H].luma_hpp = interp_horiz_pp_c<8, W, H>; \ @@ -403,7 +425,8 @@ p.pu[LUMA_ ## W ## x ## H].luma_vps = interp_vert_ps_c<8, W, H>; \ p.pu[LUMA_ ## W ## x ## H].luma_vsp = interp_vert_sp_c<8, W, H>; \ p.pu[LUMA_ ## W ## x ## H].luma_vss = interp_vert_ss_c<8, W, H>; \ - p.pu[LUMA_ ## W ## x ## H].luma_hvpp = interp_hv_pp_c<8, W, H>; + p.pu[LUMA_ ## W ## x ## H].luma_hvpp = interp_hv_pp_c<8, W, H>; \ + p.pu[LUMA_ ## W ## x ## H].filter_p2s = pixelToShort_c<MAX_CU_SIZE, W, H> void setupFilterPrimitives_c(EncoderPrimitives& p) { @@ -507,11 +530,11 @@ CHROMA_444(48, 64); CHROMA_444(64, 16); CHROMA_444(16, 64); - p.luma_p2s = filterConvertPelToShort_c<MAX_CU_SIZE>; + p.luma_p2s = filterPixelToShort_c<MAX_CU_SIZE>; - p.chroma[X265_CSP_I444].p2s = filterConvertPelToShort_c<MAX_CU_SIZE>; - p.chroma[X265_CSP_I420].p2s = filterConvertPelToShort_c<MAX_CU_SIZE / 2>; - p.chroma[X265_CSP_I422].p2s = filterConvertPelToShort_c<MAX_CU_SIZE / 2>; + p.chroma[X265_CSP_I444].p2s = filterPixelToShort_c<MAX_CU_SIZE>; + p.chroma[X265_CSP_I420].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>; + p.chroma[X265_CSP_I422].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>; p.extendRowBorder = extendCURowColBorder; }
View file
x265_1.5.tar.gz/source/common/lowres.cpp -> x265_1.6.tar.gz/source/common/lowres.cpp
Changed
@@ -56,12 +56,11 @@ CHECKED_MALLOC(propagateCost, uint16_t, cuCount); /* allocate lowres buffers */ - for (int i = 0; i < 4; i++) - { - CHECKED_MALLOC(buffer[i], pixel, planesize); - /* initialize the whole buffer to prevent valgrind warnings on right edge */ - memset(buffer[i], 0, sizeof(pixel) * planesize); - } + CHECKED_MALLOC_ZERO(buffer[0], pixel, 4 * planesize); + + buffer[1] = buffer[0] + planesize; + buffer[2] = buffer[1] + planesize; + buffer[3] = buffer[2] + planesize; lowresPlane[0] = buffer[0] + padoffset; lowresPlane[1] = buffer[1] + padoffset; @@ -96,9 +95,7 @@ void Lowres::destroy() { - for (int i = 0; i < 4; i++) - X265_FREE(buffer[i]); - + X265_FREE(buffer[0]); X265_FREE(intraCost); X265_FREE(intraMode); @@ -126,13 +123,11 @@ } // (re) initialize lowres state -void Lowres::init(PicYuv *origPic, int poc, int type) +void Lowres::init(PicYuv *origPic, int poc) { - bIntraCalculated = false; bLastMiniGopBFrame = false; bScenecut = true; // could be a scene-cut, until ruled out by flash detection bKeyframe = false; // Not a keyframe unless identified by lookahead - sliceType = type; frameNum = poc; leadingBframes = 0; indB = 0; @@ -158,8 +153,8 @@ /* downscale and generate 4 hpel planes for lookahead */ primitives.frameInitLowres(origPic->m_picOrg[0], - lowresPlane[0], lowresPlane[1], lowresPlane[2], lowresPlane[3], - origPic->m_stride, lumaStride, width, lines); + lowresPlane[0], lowresPlane[1], lowresPlane[2], lowresPlane[3], + origPic->m_stride, lumaStride, width, lines); /* extend hpel planes for motion search */ extendPicBorder(lowresPlane[0], lumaStride, width, lines, origPic->m_lumaMarginX, origPic->m_lumaMarginY);
View file
x265_1.5.tar.gz/source/common/lowres.h -> x265_1.6.tar.gz/source/common/lowres.h
Changed
@@ -114,7 +114,6 @@ int lines; // height of lowres frame in pixel lines int leadingBframes; // number of leading B frames for P or I - bool bIntraCalculated; bool bScenecut; // Set to false if the frame cannot possibly be part of a real scenecut. bool bKeyframe; bool bLastMiniGopBFrame; @@ -151,7 +150,7 @@ bool create(PicYuv *origPic, int _bframes, bool bAqEnabled); void destroy(); - void init(PicYuv *origPic, int poc, int sliceType); + void init(PicYuv *origPic, int poc); }; }
View file
x265_1.5.tar.gz/source/common/mv.h -> x265_1.6.tar.gz/source/common/mv.h
Changed
@@ -56,12 +56,17 @@ MV& operator >>=(int i) { x >>= i; y >>= i; return *this; } +#if USING_FTRAPV + /* avoid signed left-shifts when -ftrapv is enabled */ + MV& operator <<=(int i) { x *= (1 << i); y *= (1 << i); return *this; } + MV operator <<(int i) const { return MV(x * (1 << i), y * (1 << i)); } +#else MV& operator <<=(int i) { x <<= i; y <<= i; return *this; } + MV operator <<(int i) const { return MV(x << i, y << i); } +#endif MV operator >>(int i) const { return MV(x >> i, y >> i); } - MV operator <<(int i) const { return MV(x << i, y << i); } - MV operator *(int16_t i) const { return MV(x * i, y * i); } MV operator -(const MV& other) const { return MV(x - other.x, y - other.y); }
View file
x265_1.5.tar.gz/source/common/param.cpp -> x265_1.6.tar.gz/source/common/param.cpp
Changed
@@ -52,9 +52,7 @@ */ #undef strtok_r -char* strtok_r(char * str, - const char *delim, - char ** nextp) +char* strtok_r(char* str, const char* delim, char** nextp) { if (!str) str = *nextp; @@ -87,20 +85,19 @@ } extern "C" -void x265_param_free(x265_param *p) +void x265_param_free(x265_param* p) { return x265_free(p); } extern "C" -void x265_param_default(x265_param *param) +void x265_param_default(x265_param* param) { memset(param, 0, sizeof(x265_param)); /* Applying default values to all elements in the param structure */ param->cpuid = x265::cpu_detect(); param->bEnableWavefront = 1; - param->poolNumThreads = 0; param->frameNumThreads = 0; param->logLevel = X265_LOG_INFO; @@ -127,8 +124,10 @@ /* CU definitions */ param->maxCUSize = 64; + param->minCUSize = 8; param->tuQTMaxInterDepth = 1; param->tuQTMaxIntraDepth = 1; + param->maxTUSize = 32; /* Coding Structure */ param->keyframeMin = 0; @@ -139,6 +138,7 @@ param->bFrameAdaptive = X265_B_ADAPT_TRELLIS; param->bBPyramid = 1; param->scenecutThreshold = 40; /* Magic number pulled in from x264 */ + param->lookaheadSlices = 0; /* Intra Coding Tools */ param->bEnableConstrainedIntra = 0; @@ -153,10 +153,10 @@ param->bEnableWeightedPred = 1; param->bEnableWeightedBiPred = 0; param->bEnableEarlySkip = 0; - param->bEnableCbfFastMode = 0; param->bEnableAMP = 0; param->bEnableRectInter = 0; param->rdLevel = 3; + param->rdoqLevel = 0; param->bEnableSignHiding = 1; param->bEnableTransformSkip = 0; param->bEnableTSkipFast = 0; @@ -175,12 +175,13 @@ param->crQpOffset = 0; param->rdPenalty = 0; param->psyRd = 0.3; - param->psyRdoq = 1.0; + param->psyRdoq = 0.0; param->analysisMode = 0; param->analysisFileName = NULL; param->bIntraInBFrames = 0; param->bLossless = 0; param->bCULossless = 0; + param->bEnableTemporalSubLayers = 0; /* Rate control options */ param->rc.vbvMaxBitrate = 0; @@ -232,7 +233,7 @@ } extern "C" -int x265_param_default_preset(x265_param *param, const char *preset, const char *tune) +int x265_param_default_preset(x265_param* param, const char* preset, const char* tune) { x265_param_default(param); @@ -245,10 +246,11 @@ if (!strcmp(preset, "ultrafast")) { - param->lookaheadDepth = 10; + param->lookaheadDepth = 5; param->scenecutThreshold = 0; // disable lookahead param->maxCUSize = 32; - param->searchRange = 25; + param->minCUSize = 16; + param->bframes = 3; param->bFrameAdaptive = 0; param->subpelRefine = 0; param->searchMethod = X265_DIA_SEARCH; @@ -267,7 +269,7 @@ { param->lookaheadDepth = 10; param->maxCUSize = 32; - param->searchRange = 44; + param->bframes = 3; param->bFrameAdaptive = 0; param->subpelRefine = 1; param->bEnableEarlySkip = 1; @@ -319,6 +321,8 @@ param->bEnableRectInter = 1; param->lookaheadDepth = 25; param->rdLevel = 4; + param->rdoqLevel = 2; + param->psyRdoq = 1.0; param->subpelRefine = 3; param->maxNumMergeCand = 3; param->searchMethod = X265_STAR_SEARCH; @@ -333,6 +337,8 @@ param->tuQTMaxInterDepth = 2; param->tuQTMaxIntraDepth = 2; param->rdLevel = 6; + param->rdoqLevel = 2; + param->psyRdoq = 1.0; param->subpelRefine = 3; param->maxNumMergeCand = 3; param->searchMethod = X265_STAR_SEARCH; @@ -348,6 +354,8 @@ param->tuQTMaxInterDepth = 3; param->tuQTMaxIntraDepth = 3; param->rdLevel = 6; + param->rdoqLevel = 2; + param->psyRdoq = 1.0; param->subpelRefine = 4; param->maxNumMergeCand = 4; param->searchMethod = X265_STAR_SEARCH; @@ -365,6 +373,8 @@ param->tuQTMaxInterDepth = 4; param->tuQTMaxIntraDepth = 4; param->rdLevel = 6; + param->rdoqLevel = 2; + param->psyRdoq = 1.0; param->subpelRefine = 5; param->maxNumMergeCand = 5; param->searchMethod = X265_STAR_SEARCH; @@ -415,11 +425,11 @@ param->deblockingFilterBetaOffset = -2; param->deblockingFilterTCOffset = -2; param->bIntraInBFrames = 0; + param->rdoqLevel = 1; param->psyRdoq = 30; param->psyRd = 0.5; param->rc.ipFactor = 1.1; param->rc.pbFactor = 1.1; - param->rc.aqMode = X265_AQ_VARIANCE; param->rc.aqStrength = 0.3; param->rc.qCompress = 0.8; } @@ -430,7 +440,7 @@ return 0; } -static int x265_atobool(const char *str, bool& bError) +static int x265_atobool(const char* str, bool& bError) { if (!strcmp(str, "1") || !strcmp(str, "true") || @@ -444,7 +454,7 @@ return 0; } -static double x265_atof(const char *str, bool& bError) +static double x265_atof(const char* str, bool& bError) { char *end; double v = strtod(str, &end); @@ -454,7 +464,7 @@ return v; } -static int parseName(const char *arg, const char * const * names, bool& bError) +static int parseName(const char* arg, const char* const* names, bool& bError) { for (int i = 0; names[i]; i++) if (!strcmp(arg, names[i])) @@ -471,7 +481,7 @@ #define atobool(str) (bNameWasBool = true, x265_atobool(str, bError)) extern "C" -int x265_param_parse(x265_param *p, const char *name, const char *value) +int x265_param_parse(x265_param* p, const char* name, const char* value) { bool bError = false; bool bNameWasBool = false; @@ -543,7 +553,6 @@ } }
View file
x265_1.5.tar.gz/source/common/picyuv.cpp -> x265_1.6.tar.gz/source/common/picyuv.cpp
Changed
@@ -84,7 +84,7 @@ * allocated by the same encoder. */ bool PicYuv::createOffsets(const SPS& sps) { - uint32_t numPartitions = 1 << (g_maxFullDepth * 2); + uint32_t numPartitions = 1 << (g_unitSizeDepth * 2); CHECKED_MALLOC(m_cuOffsetY, intptr_t, sps.numCuInWidth * sps.numCuInHeight); CHECKED_MALLOC(m_cuOffsetC, intptr_t, sps.numCuInWidth * sps.numCuInHeight); for (uint32_t cuRow = 0; cuRow < sps.numCuInHeight; cuRow++) @@ -176,9 +176,7 @@ for (int r = 0; r < height; r++) { for (int c = 0; c < width; c++) - { yPixel[c] = (pixel)yChar[c]; - } yPixel += m_stride; yChar += pic.stride[0] / sizeof(*yChar); @@ -229,9 +227,7 @@ for (int r = 0; r < height; r++) { for (int x = 0; x < padx; x++) - { Y[width + x] = Y[width - 1]; - } Y += m_stride; } @@ -257,9 +253,7 @@ pixel *V = m_picOrg[2] + ((height >> m_vChromaShift) - 1) * m_strideC; for (int i = 1; i <= pady; i++) - { memcpy(Y + i * m_stride, Y, (width + padx) * sizeof(pixel)); - } for (int j = 1; j <= pady >> m_vChromaShift; j++) {
View file
x265_1.5.tar.gz/source/common/pixel.cpp -> x265_1.6.tar.gz/source/common/pixel.cpp
Changed
@@ -428,7 +428,7 @@ void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift) { X265_CHECK(((intptr_t)dst & 15) == 0, "dst alignment error\n"); - X265_CHECK((((intptr_t)src | srcStride) & 15) == 0 || size == 4, "src alignment error\n"); + X265_CHECK((((intptr_t)src | (srcStride * sizeof(*src))) & 15) == 0 || size == 4, "src alignment error\n"); X265_CHECK(shift >= 0, "invalid shift\n"); for (int i = 0; i < size; i++) @@ -445,7 +445,7 @@ void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift) { X265_CHECK(((intptr_t)dst & 15) == 0, "dst alignment error\n"); - X265_CHECK((((intptr_t)src | srcStride) & 15) == 0 || size == 4, "src alignment error\n"); + X265_CHECK((((intptr_t)src | (srcStride * sizeof(*src))) & 15) == 0 || size == 4, "src alignment error\n"); X265_CHECK(shift > 0, "invalid shift\n"); int16_t round = 1 << (shift - 1); @@ -462,7 +462,7 @@ template<int size> void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift) { - X265_CHECK((((intptr_t)dst | dstStride) & 15) == 0 || size == 4, "dst alignment error\n"); + X265_CHECK((((intptr_t)dst | (dstStride * sizeof(*dst))) & 15) == 0 || size == 4, "dst alignment error\n"); X265_CHECK(((intptr_t)src & 15) == 0, "src alignment error\n"); X265_CHECK(shift >= 0, "invalid shift\n"); @@ -479,7 +479,7 @@ template<int size> void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift) { - X265_CHECK((((intptr_t)dst | dstStride) & 15) == 0 || size == 4, "dst alignment error\n"); + X265_CHECK((((intptr_t)dst | (dstStride * sizeof(*dst))) & 15) == 0 || size == 4, "dst alignment error\n"); X265_CHECK(((intptr_t)src & 15) == 0, "src alignment error\n"); X265_CHECK(shift > 0, "invalid shift\n"); @@ -522,12 +522,10 @@ #if CHECKED_BUILD || _DEBUG const int correction = (IF_INTERNAL_PREC - X265_DEPTH); -#endif - X265_CHECK(!((w0 << 6) > 32767), "w0 using more than 16 bits, asm output will mismatch\n"); X265_CHECK(!(round > 32767), "round using more than 16 bits, asm output will mismatch\n"); X265_CHECK((shift >= correction), "shift must be include factor correction, please update ASM ABI\n"); - X265_CHECK(!(round & ((1 << correction) - 1)), "round must be include factor correction, please update ASM ABI\n"); +#endif for (y = 0; y <= height - 1; y++) {
View file
x265_1.5.tar.gz/source/common/predict.cpp -> x265_1.6.tar.gz/source/common/predict.cpp
Changed
@@ -34,11 +34,23 @@ #pragma warning(disable: 4127) // conditional expression is constant #endif +PredictionUnit::PredictionUnit(const CUData& cu, const CUGeom& cuGeom, int puIdx) +{ + /* address of CTU */ + ctuAddr = cu.m_cuAddr; + + /* offset of CU */ + cuAbsPartIdx = cuGeom.absPartIdx; + + /* offset and dimensions of PU */ + cu.getPartIndexAndSize(puIdx, puAbsPartIdx, width, height); +} + namespace { inline pixel weightBidir(int w0, int16_t P0, int w1, int16_t P1, int round, int shift, int offset) { - return x265_clip((w0 * (P0 + IF_INTERNAL_OFFS) + w1 * (P1 + IF_INTERNAL_OFFS) + round + (offset << (shift - 1))) >> shift); + return x265_clip((w0 * (P0 + IF_INTERNAL_OFFS) + w1 * (P1 + IF_INTERNAL_OFFS) + round + (offset * (1 << (shift - 1)))) >> shift); } } @@ -67,82 +79,24 @@ return false; } -void Predict::predIntraLumaAng(uint32_t dirMode, pixel* dst, intptr_t stride, uint32_t log2TrSize) -{ - int sizeIdx = log2TrSize - 2; - int tuSize = 1 << log2TrSize; - int filter = !!(g_intraFilterFlags[dirMode] & tuSize); - X265_CHECK(sizeIdx >= 0 && sizeIdx < 4, "intra block size is out of range\n"); - - bool bFilter = log2TrSize <= 4; - primitives.cu[sizeIdx].intra_pred[dirMode](dst, stride, intraNeighbourBuf[filter], dirMode, bFilter); -} - -void Predict::predIntraChromaAng(uint32_t dirMode, pixel* dst, intptr_t stride, uint32_t log2TrSizeC, int chFmt) -{ - int tuSize = 1 << log2TrSizeC; - int tuSize2 = tuSize << 1; - - pixel* srcBuf = intraNeighbourBuf[0]; - - if (chFmt == X265_CSP_I444 && (g_intraFilterFlags[dirMode] & tuSize)) - { - pixel* fltBuf = intraNeighbourBuf[1]; - pixel topLeft = srcBuf[0], topLast = srcBuf[tuSize2], leftLast = srcBuf[tuSize2 + tuSize2]; - - // filtering top - for (int i = 1; i < tuSize2; i++) - fltBuf[i] = ((srcBuf[i] << 1) + srcBuf[i - 1] + srcBuf[i + 1] + 2) >> 2; - fltBuf[tuSize2] = topLast; - - // filtering top-left - fltBuf[0] = ((srcBuf[0] << 1) + srcBuf[1] + srcBuf[tuSize2 + 1] + 2) >> 2; - - //filtering left - fltBuf[tuSize2 + 1] = ((srcBuf[tuSize2 + 1] << 1) + topLeft + srcBuf[tuSize2 + 2] + 2) >> 2; - for (int i = tuSize2 + 2; i < tuSize2 + tuSize2; i++) - fltBuf[i] = ((srcBuf[i] << 1) + srcBuf[i - 1] + srcBuf[i + 1] + 2) >> 2; - fltBuf[tuSize2 + tuSize2] = leftLast; - - srcBuf = intraNeighbourBuf[1]; - } - - int sizeIdx = log2TrSizeC - 2; - X265_CHECK(sizeIdx >= 0 && sizeIdx < 4, "intra block size is out of range\n"); - primitives.cu[sizeIdx].intra_pred[dirMode](dst, stride, srcBuf, dirMode, 0); -} - -void Predict::initMotionCompensation(const CUData& cu, const CUGeom& cuGeom, int partIdx) +void Predict::motionCompensation(const CUData& cu, const PredictionUnit& pu, Yuv& predYuv, bool bLuma, bool bChroma) { - m_predSlice = cu.m_slice; - cu.getPartIndexAndSize(partIdx, m_puAbsPartIdx, m_puWidth, m_puHeight); - m_ctuAddr = cu.m_cuAddr; - m_cuAbsPartIdx = cuGeom.encodeIdx; -} - -void Predict::prepMotionCompensation(const CUData& cu, const CUGeom& cuGeom, int partIdx) -{ - initMotionCompensation(cu, cuGeom, partIdx); - - m_refIdx0 = cu.m_refIdx[0][m_puAbsPartIdx]; - m_clippedMv[0] = cu.m_mv[0][m_puAbsPartIdx]; - m_refIdx1 = cu.m_refIdx[1][m_puAbsPartIdx]; - m_clippedMv[1] = cu.m_mv[1][m_puAbsPartIdx]; - cu.clipMv(m_clippedMv[0]); - cu.clipMv(m_clippedMv[1]); -} + int refIdx0 = cu.m_refIdx[0][pu.puAbsPartIdx]; + int refIdx1 = cu.m_refIdx[1][pu.puAbsPartIdx]; -void Predict::motionCompensation(Yuv& predYuv, bool bLuma, bool bChroma) -{ - if (m_predSlice->isInterP()) + if (cu.m_slice->isInterP()) { /* P Slice */ WeightValues wv0[3]; - X265_CHECK(m_refIdx0 >= 0, "invalid P refidx\n"); - X265_CHECK(m_refIdx0 < m_predSlice->m_numRefIdx[0], "P refidx out of range\n"); - const WeightParam *wp0 = m_predSlice->m_weightPredTable[0][m_refIdx0]; - if (m_predSlice->m_pps->bUseWeightPred && wp0->bPresentFlag) + X265_CHECK(refIdx0 >= 0, "invalid P refidx\n"); + X265_CHECK(refIdx0 < cu.m_slice->m_numRefIdx[0], "P refidx out of range\n"); + const WeightParam *wp0 = cu.m_slice->m_weightPredTable[0][refIdx0]; + + MV mv0 = cu.m_mv[0][pu.puAbsPartIdx]; + cu.clipMv(mv0); + + if (cu.m_slice->m_pps->bUseWeightPred && wp0->bPresentFlag) { for (int plane = 0; plane < 3; plane++) { @@ -155,18 +109,18 @@ ShortYuv& shortYuv = m_predShortYuv[0]; if (bLuma) - predInterLumaShort(shortYuv, *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]); + predInterLumaShort(pu, shortYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); if (bChroma) - predInterChromaShort(shortYuv, *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]); + predInterChromaShort(pu, shortYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); - addWeightUni(predYuv, shortYuv, wv0, bLuma, bChroma); + addWeightUni(pu, predYuv, shortYuv, wv0, bLuma, bChroma); } else { if (bLuma) - predInterLumaPixel(predYuv, *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]); + predInterLumaPixel(pu, predYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); if (bChroma) - predInterChromaPixel(predYuv, *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]); + predInterChromaPixel(pu, predYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); } } else @@ -176,10 +130,13 @@ WeightValues wv0[3], wv1[3]; const WeightParam *pwp0, *pwp1; - if (m_predSlice->m_pps->bUseWeightedBiPred) + X265_CHECK(refIdx0 < cu.m_slice->m_numRefIdx[0], "bidir refidx0 out of range\n"); + X265_CHECK(refIdx1 < cu.m_slice->m_numRefIdx[1], "bidir refidx1 out of range\n"); + + if (cu.m_slice->m_pps->bUseWeightedBiPred) { - pwp0 = m_refIdx0 >= 0 ? m_predSlice->m_weightPredTable[0][m_refIdx0] : NULL; - pwp1 = m_refIdx1 >= 0 ? m_predSlice->m_weightPredTable[1][m_refIdx1] : NULL; + pwp0 = refIdx0 >= 0 ? cu.m_slice->m_weightPredTable[0][refIdx0] : NULL; + pwp1 = refIdx1 >= 0 ? cu.m_slice->m_weightPredTable[1][refIdx1] : NULL; if (pwp0 && pwp1 && (pwp0->bPresentFlag || pwp1->bPresentFlag)) { @@ -200,7 +157,7 @@ else { /* uniprediction weighting, always outputs to wv0 */ - const WeightParam* pwp = (m_refIdx0 >= 0) ? pwp0 : pwp1; + const WeightParam* pwp = (refIdx0 >= 0) ? pwp0 : pwp1; for (int plane = 0; plane < 3; plane++) { wv0[plane].w = pwp[plane].inputWeight; @@ -213,89 +170,92 @@ else pwp0 = pwp1 = NULL; - if (m_refIdx0 >= 0 && m_refIdx1 >= 0) + if (refIdx0 >= 0 && refIdx1 >= 0) { - /* Biprediction */ - X265_CHECK(m_refIdx0 < m_predSlice->m_numRefIdx[0], "bidir refidx0 out of range\n"); - X265_CHECK(m_refIdx1 < m_predSlice->m_numRefIdx[1], "bidir refidx1 out of range\n"); + MV mv0 = cu.m_mv[0][pu.puAbsPartIdx]; + MV mv1 = cu.m_mv[1][pu.puAbsPartIdx]; + cu.clipMv(mv0); + cu.clipMv(mv1); if (bLuma) { - predInterLumaShort(m_predShortYuv[0], *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]); - predInterLumaShort(m_predShortYuv[1], *m_predSlice->m_refPicList[1][m_refIdx1]->m_reconPic, m_clippedMv[1]); + predInterLumaShort(pu, m_predShortYuv[0], *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); + predInterLumaShort(pu, m_predShortYuv[1], *cu.m_slice->m_refPicList[1][refIdx1]->m_reconPic, mv1); } if (bChroma) { - predInterChromaShort(m_predShortYuv[0], *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]); - predInterChromaShort(m_predShortYuv[1], *m_predSlice->m_refPicList[1][m_refIdx1]->m_reconPic, m_clippedMv[1]); + predInterChromaShort(pu, m_predShortYuv[0], *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); + predInterChromaShort(pu, m_predShortYuv[1], *cu.m_slice->m_refPicList[1][refIdx1]->m_reconPic, mv1); }
View file
x265_1.5.tar.gz/source/common/predict.h -> x265_1.6.tar.gz/source/common/predict.h
Changed
@@ -36,6 +36,17 @@ class Slice; struct CUGeom; +struct PredictionUnit +{ + uint32_t ctuAddr; // raster index of current CTU within its picture + uint32_t cuAbsPartIdx; // z-order offset of current CU within its CTU + uint32_t puAbsPartIdx; // z-order offset of current PU with its CU + int width; + int height; + + PredictionUnit(const CUData& cu, const CUGeom& cuGeom, int puIdx); +}; + class Predict { public: @@ -56,7 +67,7 @@ int leftUnits; int unitWidth; int unitHeight; - int tuSize; + int log2TrSize; bool bNeighborFlags[4 * MAX_NUM_SPU_W + 1]; }; @@ -65,38 +76,34 @@ // Unfiltered/filtered neighbours of the current partition. pixel intraNeighbourBuf[2][258]; + /* Slice information */ - const Slice* m_predSlice; int m_csp; int m_hChromaShift; int m_vChromaShift; - /* cached CU information for prediction */ - uint32_t m_ctuAddr; // raster index of current CTU within its picture - uint32_t m_cuAbsPartIdx; // z-order index of current CU within its CTU - uint32_t m_puAbsPartIdx; // z-order index of current PU with its CU - int m_puWidth; - int m_puHeight; - int m_refIdx0; - int m_refIdx1; - - /* TODO: Need to investigate clipping while writing into the TComDataCU fields itself */ - MV m_clippedMv[2]; - Predict(); ~Predict(); bool allocBuffers(int csp); // motion compensation functions - void predInterLumaPixel(Yuv& dstYuv, const PicYuv& refPic, const MV& mv) const; - void predInterChromaPixel(Yuv& dstYuv, const PicYuv& refPic, const MV& mv) const; + void predInterLumaPixel(const PredictionUnit& pu, Yuv& dstYuv, const PicYuv& refPic, const MV& mv) const; + void predInterChromaPixel(const PredictionUnit& pu, Yuv& dstYuv, const PicYuv& refPic, const MV& mv) const; - void predInterLumaShort(ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const; - void predInterChromaShort(ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const; + void predInterLumaShort(const PredictionUnit& pu, ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const; + void predInterChromaShort(const PredictionUnit& pu, ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const; - void addWeightBi(Yuv& predYuv, const ShortYuv& srcYuv0, const ShortYuv& srcYuv1, const WeightValues wp0[3], const WeightValues wp1[3], bool bLuma, bool bChroma) const; - void addWeightUni(Yuv& predYuv, const ShortYuv& srcYuv, const WeightValues wp[3], bool bLuma, bool bChroma) const; + void addWeightBi(const PredictionUnit& pu, Yuv& predYuv, const ShortYuv& srcYuv0, const ShortYuv& srcYuv1, const WeightValues wp0[3], const WeightValues wp1[3], bool bLuma, bool bChroma) const; + void addWeightUni(const PredictionUnit& pu, Yuv& predYuv, const ShortYuv& srcYuv, const WeightValues wp[3], bool bLuma, bool bChroma) const; + + void motionCompensation(const CUData& cu, const PredictionUnit& pu, Yuv& predYuv, bool bLuma, bool bChroma); + + /* Angular Intra */ + void predIntraLumaAng(uint32_t dirMode, pixel* pred, intptr_t stride, uint32_t log2TrSize); + void predIntraChromaAng(uint32_t dirMode, pixel* pred, intptr_t stride, uint32_t log2TrSizeC); + void initAdiPattern(const CUData& cu, const CUGeom& cuGeom, uint32_t puAbsPartIdx, const IntraNeighbors& intraNeighbors, int dirMode); + void initAdiPatternChroma(const CUData& cu, const CUGeom& cuGeom, uint32_t puAbsPartIdx, const IntraNeighbors& intraNeighbors, uint32_t chromaId); /* Intra prediction helper functions */ static void initIntraNeighbors(const CUData& cu, uint32_t absPartIdx, uint32_t tuDepth, bool isLuma, IntraNeighbors *IntraNeighbors); @@ -111,19 +118,6 @@ static int isAboveRightAvailable(const CUData& cu, uint32_t partIdxRT, bool* bValidFlags, uint32_t numUnits); template<bool cip> static int isBelowLeftAvailable(const CUData& cu, uint32_t partIdxLB, bool* bValidFlags, uint32_t numUnits); - -public: - - /* prepMotionCompensation needs to be called to prepare MC with CU-relevant data */ - void initMotionCompensation(const CUData& cu, const CUGeom& cuGeom, int partIdx); - void prepMotionCompensation(const CUData& cu, const CUGeom& cuGeom, int partIdx); - void motionCompensation(Yuv& predYuv, bool bLuma, bool bChroma); - - /* Angular Intra */ - void predIntraLumaAng(uint32_t dirMode, pixel* pred, intptr_t stride, uint32_t log2TrSize); - void predIntraChromaAng(uint32_t dirMode, pixel* pred, intptr_t stride, uint32_t log2TrSizeC, int chFmt); - void initAdiPattern(const CUData& cu, const CUGeom& cuGeom, uint32_t absPartIdx, const IntraNeighbors& intraNeighbors, int dirMode); - void initAdiPatternChroma(const CUData& cu, const CUGeom& cuGeom, uint32_t absPartIdx, const IntraNeighbors& intraNeighbors, uint32_t chromaId); }; }
View file
x265_1.5.tar.gz/source/common/primitives.cpp -> x265_1.6.tar.gz/source/common/primitives.cpp
Changed
@@ -98,6 +98,7 @@ p.chroma[X265_CSP_I444].pu[i].copy_pp = p.pu[i].copy_pp; p.chroma[X265_CSP_I444].pu[i].addAvg = p.pu[i].addAvg; p.chroma[X265_CSP_I444].pu[i].satd = p.pu[i].satd; + p.chroma[X265_CSP_I444].pu[i].chroma_p2s = p.pu[i].filter_p2s; } for (int i = 0; i < NUM_CU_SIZES; i++) @@ -190,7 +191,6 @@ /* cpuid >= 0 - force CPU type * cpuid < 0 - auto-detect if uninitialized */ -extern "C" void x265_setup_primitives(x265_param *param, int cpuid) { if (cpuid < 0) @@ -257,7 +257,7 @@ extern "C" { int x265_cpu_cpuid_test(void) { return 0; } void x265_cpu_emms(void) {} -void x265_cpu_cpuid(uint32_t, uint32_t *, uint32_t *, uint32_t *, uint32_t *) {} +void x265_cpu_cpuid(uint32_t, uint32_t *eax, uint32_t *, uint32_t *, uint32_t *) { *eax = 0; } void x265_cpu_xgetbv(uint32_t, uint32_t *, uint32_t *) {} } #endif
View file
x265_1.5.tar.gz/source/common/primitives.h -> x265_1.6.tar.gz/source/common/primitives.h
Changed
@@ -119,6 +119,7 @@ typedef void (*intra_pred_t)(pixel* dst, intptr_t dstStride, const pixel *srcPix, int dirMode, int bFilter); typedef void (*intra_allangs_t)(pixel *dst, pixel *refPix, pixel *filtPix, int bLuma); +typedef void (*intra_filter_t)(const pixel* references, pixel* filtered); typedef void (*cpy2Dto1D_shl_t)(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); typedef void (*cpy2Dto1D_shr_t)(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); @@ -136,8 +137,7 @@ typedef uint32_t (*nquant_t)(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff); typedef void (*dequant_scaling_t)(const int16_t* src, const int32_t* dequantCoef, int16_t* dst, int num, int mcqp_miper, int shift); typedef void (*dequant_normal_t)(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift); -typedef int (*count_nonzero_t)(const int16_t* quantCoeff, int numCoeff); - +typedef int(*count_nonzero_t)(const int16_t* quantCoeff); typedef void (*weightp_pp_t)(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset); typedef void (*weightp_sp_t)(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset); typedef void (*scale_t)(pixel* dst, const pixel* src, intptr_t stride); @@ -155,7 +155,8 @@ typedef void (*filter_sp_t) (const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); typedef void (*filter_ss_t) (const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); typedef void (*filter_hv_pp_t) (const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); -typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height); +typedef void (*filter_p2s_wxh_t)(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height); +typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst); typedef void (*copy_pp_t)(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); // dst is aligned typedef void (*copy_sp_t)(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); @@ -178,6 +179,8 @@ typedef void (*cutree_propagate_cost) (int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len); +typedef int (*findPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig); + /* Function pointers to optimized encoder primitives. Each pointer can reference * either an assembly routine, a SIMD intrinsic primitive, or a C function */ struct EncoderPrimitives @@ -207,6 +210,7 @@ addAvg_t addAvg; // bidir motion compensation, uses 16bit values copy_pp_t copy_pp; + filter_p2s_t filter_p2s; } pu[NUM_PU_SIZES]; @@ -225,7 +229,7 @@ pixel_add_ps_t add_ps; blockfill_s_t blockfill_s; // block fill, for DC transforms copy_cnt_t copy_cnt; // copy coeff while counting non-zero - + count_nonzero_t count_nonzero; cpy2Dto1D_shl_t cpy2Dto1D_shl; cpy2Dto1D_shr_t cpy2Dto1D_shr; cpy1Dto2D_shl_t cpy1Dto2D_shl; @@ -246,6 +250,7 @@ transpose_t transpose; // transpose pixel block; for use with intra all-angs intra_allangs_t intra_pred_allangs; + intra_filter_t intra_filter; intra_pred_t intra_pred[NUM_INTRA_MODE]; } cu[NUM_CU_SIZES]; @@ -260,9 +265,7 @@ nquant_t nquant; dequant_scaling_t dequant_scaling; dequant_normal_t dequant_normal; - count_nonzero_t count_nonzero; denoiseDct_t denoiseDct; - scale_t scale1D_128to64; scale_t scale2D_64to32; @@ -286,7 +289,9 @@ weightp_sp_t weight_sp; weightp_pp_t weight_pp; - filter_p2s_t luma_p2s; + filter_p2s_wxh_t luma_p2s; + + findPosLast_t findPosLast; /* There is one set of chroma primitives per color space. An encoder will * have just a single color space and thus it will only ever use one entry @@ -311,6 +316,8 @@ filter_hps_t filter_hps; addAvg_t addAvg; copy_pp_t copy_pp; + filter_p2s_t chroma_p2s; + } pu[NUM_PU_SIZES]; @@ -329,7 +336,7 @@ } cu[NUM_CU_SIZES]; - filter_p2s_t p2s; // takes width/height as arguments + filter_p2s_wxh_t p2s; // takes width/height as arguments } chroma[X265_CSP_COUNT]; };
View file
x265_1.5.tar.gz/source/common/quant.cpp -> x265_1.6.tar.gz/source/common/quant.cpp
Changed
@@ -50,7 +50,7 @@ return y + ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // min(x, y) } -inline int getICRate(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, uint32_t absGoRice, uint32_t c1c2Idx) +inline int getICRate(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, const uint32_t absGoRice, const uint32_t maxVlc, uint32_t c1c2Idx) { X265_CHECK(c1c2Idx <= 3, "c1c2Idx check failure\n"); X265_CHECK(absGoRice <= 4, "absGoRice check failure\n"); @@ -72,7 +72,6 @@ else { uint32_t symbol = diffLevel; - const uint32_t maxVlc = g_goRiceRange[absGoRice]; bool expGolomb = (symbol > maxVlc); if (expGolomb) @@ -105,6 +104,41 @@ return rate; } +#if CHECKED_BUILD || _DEBUG +inline int getICRateNegDiff(uint32_t absLevel, const int* greaterOneBits, const int* levelAbsBits) +{ + X265_CHECK(absLevel <= 2, "absLevel check failure\n"); + + int rate; + if (absLevel == 0) + rate = 0; + else if (absLevel == 2) + rate = greaterOneBits[1] + levelAbsBits[0]; + else + rate = greaterOneBits[0]; + return rate; +} +#endif + +inline int getICRateLessVlc(uint32_t absLevel, int32_t diffLevel, const uint32_t absGoRice) +{ + X265_CHECK(absGoRice <= 4, "absGoRice check failure\n"); + if (!absLevel) + { + X265_CHECK(diffLevel < 0, "diffLevel check failure\n"); + return 0; + } + int rate; + + uint32_t symbol = diffLevel; + uint32_t prefLen = (symbol >> absGoRice) + 1; + uint32_t numBins = fastMin(prefLen + absGoRice, 8 /* g_goRicePrefixLen[absGoRice] + absGoRice */); + + rate = numBins << 15; + + return rate; +} + /* Calculates the cost for specific absolute transform level */ inline uint32_t getICRateCost(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, uint32_t absGoRice, uint32_t c1c2Idx) { @@ -160,12 +194,12 @@ m_nr = NULL; } -bool Quant::init(bool useRDOQ, double psyScale, const ScalingList& scalingList, Entropy& entropy) +bool Quant::init(int rdoqLevel, double psyScale, const ScalingList& scalingList, Entropy& entropy) { m_entropyCoder = &entropy; - m_useRDOQ = useRDOQ; + m_rdoqLevel = rdoqLevel; m_psyRdoqScale = (int64_t)(psyScale * 256.0); - m_scalingList = &scalingList; + m_scalingList = &scalingList; m_resiDctCoeff = X265_MALLOC(int16_t, MAX_TR_SIZE * MAX_TR_SIZE * 2); m_fencDctCoeff = m_resiDctCoeff + (MAX_TR_SIZE * MAX_TR_SIZE); m_fencShortBuf = X265_MALLOC(int16_t, MAX_TR_SIZE * MAX_TR_SIZE); @@ -382,13 +416,13 @@ } } - if (m_useRDOQ) + if (m_rdoqLevel) return rdoQuant(cu, coeff, log2TrSize, ttype, absPartIdx, usePsy); else { int deltaU[32 * 32]; - int scalingListType = ttype + (isLuma ? 3 : 0); + int scalingListType = (cu.isIntra(absPartIdx) ? 0 : 3) + ttype; int rem = m_qpParam[ttype].rem; int per = m_qpParam[ttype].per; const int32_t* quantCoeff = m_scalingList->m_quantCoef[log2TrSize - 2][scalingListType][rem]; @@ -454,9 +488,7 @@ else { int useDST = !sizeIdx && ttype == TEXT_LUMA && bIntra; - - X265_CHECK((int)numSig == primitives.count_nonzero(coeff, 1 << (log2TrSize * 2)), "numSig differ\n"); - + X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(coeff), "numSig differ\n"); // DC only if (numSig == 1 && coeff[0] != 0 && !useDST) { @@ -493,13 +525,10 @@ const int32_t* qCoef = m_scalingList->m_quantCoef[log2TrSize - 2][scalingListType][rem]; int numCoeff = 1 << (log2TrSize * 2); - uint32_t numSig = primitives.nquant(m_resiDctCoeff, qCoef, dstCoeff, qbits, add, numCoeff); - - X265_CHECK((int)numSig == primitives.count_nonzero(dstCoeff, 1 << (log2TrSize * 2)), "numSig differ\n"); + X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(dstCoeff), "numSig differ\n"); if (!numSig) return 0; - uint32_t trSize = 1 << log2TrSize; int64_t lambda2 = m_qpParam[ttype].lambda2; int64_t psyScale = (m_psyRdoqScale * m_qpParam[ttype].lambda); @@ -674,9 +703,43 @@ /* record costs for sign-hiding performed at the end */ if (level) { - int rateNow = getICRate(level, level - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx); - rateIncUp[blkPos] = getICRate(level + 1, level + 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) - rateNow; - rateIncDown[blkPos] = getICRate(level - 1, level - 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) - rateNow; + const int32_t diff0 = level - 1 - baseLevel; + const int32_t diff2 = level + 1 - baseLevel; + const int32_t maxVlc = g_goRiceRange[goRiceParam]; + int rate0, rate1, rate2; + + if (diff0 < -2) // prob (92.9, 86.5, 74.5)% + { + // NOTE: Min: L - 1 - {1,2,1,3} < -2 ==> L < {0,1,0,2} + // additional L > 0, so I got (L > 0 && L < 2) ==> L = 1 + X265_CHECK(level == 1, "absLevel check failure\n"); + + const int rateEqual2 = greaterOneBits[1] + levelAbsBits[0];; + const int rateNotEqual2 = greaterOneBits[0]; + + rate0 = 0; + rate2 = rateEqual2; + rate1 = rateNotEqual2; + + X265_CHECK(rate1 == getICRateNegDiff(level + 0, greaterOneBits, levelAbsBits), "rate1 check failure!\n"); + X265_CHECK(rate2 == getICRateNegDiff(level + 1, greaterOneBits, levelAbsBits), "rate1 check failure!\n"); + X265_CHECK(rate0 == getICRateNegDiff(level - 1, greaterOneBits, levelAbsBits), "rate1 check failure!\n"); + } + else if (diff0 >= 0 && diff2 <= maxVlc) // prob except from above path (98.6, 97.9, 96.9)% + { + // NOTE: no c1c2 correct rate since all of rate include this factor + rate1 = getICRateLessVlc(level + 0, diff0 + 1, goRiceParam); + rate2 = getICRateLessVlc(level + 1, diff0 + 2, goRiceParam); + rate0 = getICRateLessVlc(level - 1, diff0 + 0, goRiceParam); + } + else + { + rate1 = getICRate(level + 0, diff0 + 1, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Idx); + rate2 = getICRate(level + 1, diff0 + 2, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Idx); + rate0 = getICRate(level - 1, diff0 + 0, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Idx); + } + rateIncUp[blkPos] = rate2 - rate1; + rateIncDown[blkPos] = rate0 - rate1; } else { @@ -762,7 +825,7 @@ costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[sigCtx][1]); totalRdCost += costCoeffGroupSig[cgScanPos]; /* add the cost of 1 bit in significant CG bitmap */ - if (costZeroCG < totalRdCost) + if (costZeroCG < totalRdCost && m_rdoqLevel > 1) { sigCoeffGroupFlag64 &= ~cgBlkPosMask; totalRdCost = costZeroCG; @@ -870,7 +933,7 @@ bestLastIdx = scanPos + 1; bestCost = costAsLast; } - if (dstCoeff[blkPos] > 1) + if (dstCoeff[blkPos] > 1 || m_rdoqLevel == 1) { foundLast = true; break; @@ -1037,7 +1100,8 @@ const uint32_t trSizeCG = 1 << log2TrSizeCG; X265_CHECK(trSizeCG <= 8, "transform CG is too large\n"); - const uint32_t sigPos = (uint32_t)(sigCoeffGroupFlag64 >> (1 + (cgPosY << log2TrSizeCG) + cgPosX)); + const uint32_t shift = (cgPosY << log2TrSizeCG) + cgPosX + 1; + const uint32_t sigPos = (uint32_t)(shift >= 64 ? 0 : sigCoeffGroupFlag64 >> shift); const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & (sigPos & 1); const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 2)) & 2;
View file
x265_1.5.tar.gz/source/common/quant.h -> x265_1.6.tar.gz/source/common/quant.h
Changed
@@ -81,7 +81,7 @@ QpParam m_qpParam[3]; - bool m_useRDOQ; + int m_rdoqLevel; int64_t m_psyRdoqScale; int16_t* m_resiDctCoeff; int16_t* m_fencDctCoeff; @@ -99,7 +99,7 @@ ~Quant(); /* one-time setup */ - bool init(bool useRDOQ, double psyScale, const ScalingList& scalingList, Entropy& entropy); + bool init(int rdoqLevel, double psyScale, const ScalingList& scalingList, Entropy& entropy); bool allocNoiseReduction(const x265_param& param); /* CU setup */
View file
x265_1.5.tar.gz/source/common/scalinglist.cpp -> x265_1.6.tar.gz/source/common/scalinglist.cpp
Changed
@@ -222,7 +222,7 @@ void ScalingList::processDefaultMarix(int sizeId, int listId) { - ::memcpy(m_scalingListCoef[sizeId][listId], getScalingListDefaultAddress(sizeId, listId), sizeof(int) * X265_MIN(MAX_MATRIX_COEF_NUM, s_numCoefPerSize[sizeId])); + memcpy(m_scalingListCoef[sizeId][listId], getScalingListDefaultAddress(sizeId, listId), sizeof(int) * X265_MIN(MAX_MATRIX_COEF_NUM, s_numCoefPerSize[sizeId])); m_scalingListDC[sizeId][listId] = SCALING_LIST_DC; }
View file
x265_1.5.tar.gz/source/common/shortyuv.cpp -> x265_1.6.tar.gz/source/common/shortyuv.cpp
Changed
@@ -66,9 +66,9 @@ void ShortYuv::clear() { - ::memset(m_buf[0], 0, (m_size * m_size) * sizeof(int16_t)); - ::memset(m_buf[1], 0, (m_csize * m_csize) * sizeof(int16_t)); - ::memset(m_buf[2], 0, (m_csize * m_csize) * sizeof(int16_t)); + memset(m_buf[0], 0, (m_size * m_size) * sizeof(int16_t)); + memset(m_buf[1], 0, (m_csize * m_csize) * sizeof(int16_t)); + memset(m_buf[2], 0, (m_csize * m_csize) * sizeof(int16_t)); } void ShortYuv::subtract(const Yuv& srcYuv0, const Yuv& srcYuv1, uint32_t log2Size)
View file
x265_1.5.tar.gz/source/common/slice.cpp -> x265_1.6.tar.gz/source/common/slice.cpp
Changed
@@ -33,7 +33,7 @@ { if (m_sliceType == I_SLICE) { - ::memset(m_refPicList, 0, sizeof(m_refPicList)); + memset(m_refPicList, 0, sizeof(m_refPicList)); m_numRefIdx[1] = m_numRefIdx[0] = 0; return; } @@ -112,7 +112,7 @@ if (m_sliceType != B_SLICE) { m_numRefIdx[1] = 0; - ::memset(m_refPicList[1], 0, sizeof(m_refPicList[1])); + memset(m_refPicList[1], 0, sizeof(m_refPicList[1])); } else { @@ -183,8 +183,8 @@ uint32_t Slice::realEndAddress(uint32_t endCUAddr) const { // Calculate end address - uint32_t internalAddress = (endCUAddr - 1) % NUM_CU_PARTITIONS; - uint32_t externalAddress = (endCUAddr - 1) / NUM_CU_PARTITIONS; + uint32_t internalAddress = (endCUAddr - 1) % NUM_4x4_PARTITIONS; + uint32_t externalAddress = (endCUAddr - 1) / NUM_4x4_PARTITIONS; uint32_t xmax = m_sps->picWidthInLumaSamples - (externalAddress % m_sps->numCuInWidth) * g_maxCUSize; uint32_t ymax = m_sps->picHeightInLumaSamples - (externalAddress / m_sps->numCuInWidth) * g_maxCUSize; @@ -192,13 +192,13 @@ internalAddress--; internalAddress++; - if (internalAddress == NUM_CU_PARTITIONS) + if (internalAddress == NUM_4x4_PARTITIONS) { internalAddress = 0; externalAddress++; } - return externalAddress * NUM_CU_PARTITIONS + internalAddress; + return externalAddress * NUM_4x4_PARTITIONS + internalAddress; }
View file
x265_1.5.tar.gz/source/common/slice.h -> x265_1.6.tar.gz/source/common/slice.h
Changed
@@ -55,9 +55,9 @@ , numberOfNegativePictures(0) , numberOfPositivePictures(0) { - ::memset(deltaPOC, 0, sizeof(deltaPOC)); - ::memset(poc, 0, sizeof(poc)); - ::memset(bUsed, 0, sizeof(bUsed)); + memset(deltaPOC, 0, sizeof(deltaPOC)); + memset(poc, 0, sizeof(poc)); + memset(bUsed, 0, sizeof(bUsed)); } void sortDeltaPOC(); @@ -149,8 +149,10 @@ struct VPS { + uint32_t maxTempSubLayers; uint32_t numReorderPics; uint32_t maxDecPicBuffering; + uint32_t maxLatencyIncrease; HRDInfo hrdParameters; ProfileTierLevel ptl; }; @@ -228,9 +230,10 @@ bool bUseAMP; // use param uint32_t maxAMPDepth; + uint32_t maxTempSubLayers; // max number of Temporal Sub layers uint32_t maxDecPicBuffering; // these are dups of VPS values + uint32_t maxLatencyIncrease; int numReorderPics; - int maxLatencyIncrease; bool bUseStrongIntraSmoothing; // use param bool bTemporalMVPEnabled; @@ -285,6 +288,14 @@ } }; +#define SET_WEIGHT(w, b, s, d, o) \ + { \ + (w).inputWeight = (s); \ + (w).log2WeightDenom = (d); \ + (w).inputOffset = (o); \ + (w).bPresentFlag = (b); \ + } + class Slice { public:
View file
x265_1.5.tar.gz/source/common/threading.cpp -> x265_1.6.tar.gz/source/common/threading.cpp
Changed
@@ -26,6 +26,13 @@ namespace x265 { // x265 private namespace +#if X265_ARCH_X86 && !defined(X86_64) && ENABLE_ASSEMBLY && defined(__GNUC__) +extern "C" intptr_t x265_stack_align(void (*func)(), ...); +#define x265_stack_align(func, ...) x265_stack_align((void (*)())func, __VA_ARGS__) +#else +#define x265_stack_align(func, ...) func(__VA_ARGS__) +#endif + /* C shim for forced stack alignment */ static void stackAlignMain(Thread *instance) {
View file
x265_1.5.tar.gz/source/common/threading.h -> x265_1.6.tar.gz/source/common/threading.h
Changed
@@ -42,32 +42,32 @@ #include <sys/sysctl.h> #endif -#ifdef __GNUC__ /* GCCs builtin atomics */ +#ifdef __GNUC__ /* GCCs builtin atomics */ #include <sys/time.h> #include <unistd.h> -#define CLZ(id, x) id = (unsigned long)__builtin_clz(x) ^ 31 -#define CTZ(id, x) id = (unsigned long)__builtin_ctz(x) -#define ATOMIC_OR(ptr, mask) __sync_fetch_and_or(ptr, mask) -#define ATOMIC_AND(ptr, mask) __sync_fetch_and_and(ptr, mask) -#define ATOMIC_INC(ptr) __sync_add_and_fetch((volatile int32_t*)ptr, 1) -#define ATOMIC_DEC(ptr) __sync_add_and_fetch((volatile int32_t*)ptr, -1) -#define ATOMIC_ADD(ptr, value) __sync_add_and_fetch((volatile int32_t*)ptr, value) -#define GIVE_UP_TIME() usleep(0) +#define CLZ(id, x) id = (unsigned long)__builtin_clz(x) ^ 31 +#define CTZ(id, x) id = (unsigned long)__builtin_ctz(x) +#define ATOMIC_OR(ptr, mask) __sync_fetch_and_or(ptr, mask) +#define ATOMIC_AND(ptr, mask) __sync_fetch_and_and(ptr, mask) +#define ATOMIC_INC(ptr) __sync_add_and_fetch((volatile int32_t*)ptr, 1) +#define ATOMIC_DEC(ptr) __sync_add_and_fetch((volatile int32_t*)ptr, -1) +#define ATOMIC_ADD(ptr, val) __sync_fetch_and_add((volatile int32_t*)ptr, val) +#define GIVE_UP_TIME() usleep(0) -#elif defined(_MSC_VER) /* Windows atomic intrinsics */ +#elif defined(_MSC_VER) /* Windows atomic intrinsics */ #include <intrin.h> -#define CLZ(id, x) _BitScanReverse(&id, x) -#define CTZ(id, x) _BitScanForward(&id, x) -#define ATOMIC_INC(ptr) InterlockedIncrement((volatile LONG*)ptr) -#define ATOMIC_DEC(ptr) InterlockedDecrement((volatile LONG*)ptr) -#define ATOMIC_ADD(ptr, value) InterlockedExchangeAdd((volatile LONG*)ptr, value) -#define ATOMIC_OR(ptr, mask) _InterlockedOr((volatile LONG*)ptr, (LONG)mask) -#define ATOMIC_AND(ptr, mask) _InterlockedAnd((volatile LONG*)ptr, (LONG)mask) -#define GIVE_UP_TIME() Sleep(0) +#define CLZ(id, x) _BitScanReverse(&id, x) +#define CTZ(id, x) _BitScanForward(&id, x) +#define ATOMIC_INC(ptr) InterlockedIncrement((volatile LONG*)ptr) +#define ATOMIC_DEC(ptr) InterlockedDecrement((volatile LONG*)ptr) +#define ATOMIC_ADD(ptr, val) InterlockedExchangeAdd((volatile LONG*)ptr, val) +#define ATOMIC_OR(ptr, mask) _InterlockedOr((volatile LONG*)ptr, (LONG)mask) +#define ATOMIC_AND(ptr, mask) _InterlockedAnd((volatile LONG*)ptr, (LONG)mask) +#define GIVE_UP_TIME() Sleep(0) #endif // ifdef __GNUC__ @@ -128,8 +128,8 @@ bool timedWait(uint32_t milliseconds) { - /* returns true if event was signaled */ - return WaitForSingleObject(this->handle, milliseconds) == WAIT_OBJECT_0; + /* returns true if the wait timed out */ + return WaitForSingleObject(this->handle, milliseconds) == WAIT_TIMEOUT; } void trigger() @@ -263,10 +263,8 @@ /* blocking wait on conditional variable, mutex is atomically released * while blocked. When condition is signaled, mutex is re-acquired */ - while (m_counter == 0) - { + while (!m_counter) pthread_cond_wait(&m_cond, &m_mutex); - } m_counter--; pthread_mutex_unlock(&m_mutex); @@ -277,7 +275,7 @@ bool bTimedOut = false; pthread_mutex_lock(&m_mutex); - if (m_counter == 0) + if (!m_counter) { struct timeval tv; struct timespec ts; @@ -297,7 +295,10 @@ bTimedOut = pthread_cond_timedwait(&m_cond, &m_mutex, &ts) == ETIMEDOUT; } if (m_counter > 0) + { m_counter--; + bTimedOut = false; + } pthread_mutex_unlock(&m_mutex); return bTimedOut; } @@ -408,6 +409,23 @@ Lock &inst; }; +// Utility class which adds elapsed time of the scope of the object into the +// accumulator provided to the constructor +struct ScopedElapsedTime +{ + ScopedElapsedTime(int64_t& accum) : accumlatedTime(accum) { startTime = x265_mdate(); } + + ~ScopedElapsedTime() { accumlatedTime += x265_mdate() - startTime; } + +protected: + + int64_t startTime; + int64_t& accumlatedTime; + + // do not allow assignments + ScopedElapsedTime &operator =(const ScopedElapsedTime &); +}; + //< Simplistic portable thread class. Shutdown signalling left to derived class class Thread {
View file
x265_1.5.tar.gz/source/common/threadpool.cpp -> x265_1.6.tar.gz/source/common/threadpool.cpp
Changed
@@ -27,115 +27,65 @@ #include <new> -#if MACOS -#include <sys/param.h> -#include <sys/sysctl.h> -#endif - -namespace x265 { -// x265 private namespace - -class ThreadPoolImpl; +#if X86_64 -class PoolThread : public Thread -{ -private: +#ifdef __GNUC__ - ThreadPoolImpl &m_pool; +#define SLEEPBITMAP_CTZ(id, x) id = (unsigned long)__builtin_ctzll(x) +#define SLEEPBITMAP_OR(ptr, mask) __sync_fetch_and_or(ptr, mask) +#define SLEEPBITMAP_AND(ptr, mask) __sync_fetch_and_and(ptr, mask) - PoolThread& operator =(const PoolThread&); +#elif defined(_MSC_VER) - int m_id; +#define SLEEPBITMAP_CTZ(id, x) _BitScanForward64(&id, x) +#define SLEEPBITMAP_OR(ptr, mask) InterlockedOr64((volatile LONG64*)ptr, (LONG)mask) +#define SLEEPBITMAP_AND(ptr, mask) InterlockedAnd64((volatile LONG64*)ptr, (LONG)mask) - bool m_dirty; +#endif // ifdef __GNUC__ - bool m_exited; - - Event m_wakeEvent; - -public: - - PoolThread(ThreadPoolImpl& pool, int id) - : m_pool(pool) - , m_id(id) - , m_dirty(false) - , m_exited(false) - { - } - - bool isDirty() const { return m_dirty; } - - void markDirty() { m_dirty = true; } +#else - bool isExited() const { return m_exited; } +/* use 32-bit primitives defined in threading.h */ +#define SLEEPBITMAP_CTZ CTZ +#define SLEEPBITMAP_OR ATOMIC_OR +#define SLEEPBITMAP_AND ATOMIC_AND - void poke() { m_wakeEvent.trigger(); } +#endif - virtual ~PoolThread() {} +#if MACOS +#include <sys/param.h> +#include <sys/sysctl.h> +#endif +#if HAVE_LIBNUMA +#include <numa.h> +#endif - void threadMain(); -}; +namespace x265 { +// x265 private namespace -class ThreadPoolImpl : public ThreadPool +class WorkerThread : public Thread { private: - bool m_ok; - int m_referenceCount; - int m_numThreads; - int m_numSleepMapWords; - PoolThread *m_threads; - volatile uint32_t *m_sleepMap; + ThreadPool& m_pool; + int m_id; + Event m_wakeEvent; - /* Lock for write access to the provider lists. Threads are - * always allowed to read m_firstProvider and follow the - * linked list. Providers must zero their m_nextProvider - * pointers before removing themselves from this list */ - Lock m_writeLock; + WorkerThread& operator =(const WorkerThread&); public: - static ThreadPoolImpl *s_instance; - static Lock s_createLock; - - JobProvider *m_firstProvider; - JobProvider *m_lastProvider; - -public: - - ThreadPoolImpl(int numthreads); - - virtual ~ThreadPoolImpl(); - - ThreadPoolImpl *AddReference() - { - m_referenceCount++; - - return this; - } - - void markThreadAsleep(int id); - - void waitForAllIdle(); - - int getThreadCount() const { return m_numThreads; } - - bool IsValid() const { return m_ok; } - - void release(); + JobProvider* m_curJobProvider; + BondedTaskGroup* m_bondMaster; - void Stop(); + WorkerThread(ThreadPool& pool, int id) : m_pool(pool), m_id(id) {} + virtual ~WorkerThread() {} - void enqueueJobProvider(JobProvider &); - - void dequeueJobProvider(JobProvider &); - - void FlushProviderList(); - - void pokeIdleThread(); + void threadMain(); + void awaken() { m_wakeEvent.trigger(); } }; -void PoolThread::threadMain() +void WorkerThread::threadMain() { THREAD_NAME("Worker", m_id); @@ -145,286 +95,361 @@ __attribute__((unused)) int val = nice(10); #endif - while (m_pool.IsValid()) + m_pool.setCurrentThreadAffinity(); + + sleepbitmap_t idBit = (sleepbitmap_t)1 << m_id; + m_curJobProvider = m_pool.m_jpTable[0]; + m_bondMaster = NULL; + + SLEEPBITMAP_OR(&m_curJobProvider->m_ownerBitmap, idBit); + SLEEPBITMAP_OR(&m_pool.m_sleepBitmap, idBit); + m_wakeEvent.wait(); + + while (m_pool.m_isActive) { - /* Walk list of job providers, looking for work */ - JobProvider *cur = m_pool.m_firstProvider; - while (cur) + if (m_bondMaster) { - // FindJob() may perform actual work and return true. If - // it does we restart the job search - if (cur->findJob(m_id) == true) - break; - - cur = cur->m_nextProvider; + m_bondMaster->processTasks(m_id); + m_bondMaster->m_exitedPeerCount.incr(); + m_bondMaster = NULL; } - // this thread has reached the end of the provider list - m_dirty = false; - - if (cur == NULL) + do { - m_pool.markThreadAsleep(m_id); - m_wakeEvent.wait(); + /* do pending work for current job provider */ + m_curJobProvider->findJob(m_id); + + /* if the current job provider still wants help, only switch to a + * higher priority provider (lower slice type). Else take the first + * available job provider with the highest priority */
View file
x265_1.5.tar.gz/source/common/threadpool.h -> x265_1.6.tar.gz/source/common/threadpool.h
Changed
@@ -25,85 +25,148 @@ #define X265_THREADPOOL_H #include "common.h" +#include "threading.h" namespace x265 { // x265 private namespace class ThreadPool; +class WorkerThread; +class BondedTaskGroup; -int getCpuCount(); +#if X86_64 +typedef uint64_t sleepbitmap_t; +#else +typedef uint32_t sleepbitmap_t; +#endif -// Any class that wants to distribute work to the thread pool must -// derive from JobProvider and implement FindJob(). +static const sleepbitmap_t ALL_POOL_THREADS = (sleepbitmap_t)-1; +enum { MAX_POOL_THREADS = sizeof(sleepbitmap_t) * 8 }; +enum { INVALID_SLICE_PRIORITY = 10 }; // a value larger than any X265_TYPE_* macro + +// Frame level job providers. FrameEncoder and Lookahead derive from +// this class and implement findJob() class JobProvider { -protected: - - ThreadPool *m_pool; - - JobProvider *m_nextProvider; - JobProvider *m_prevProvider; - public: - JobProvider(ThreadPool *p) : m_pool(p), m_nextProvider(0), m_prevProvider(0) {} + ThreadPool* m_pool; + sleepbitmap_t m_ownerBitmap; + int m_jpId; + int m_sliceType; + bool m_helpWanted; + bool m_isFrameEncoder; /* rather ugly hack, but nothing better presents itself */ + + JobProvider() + : m_pool(NULL) + , m_ownerBitmap(0) + , m_jpId(-1) + , m_sliceType(INVALID_SLICE_PRIORITY) + , m_helpWanted(false) + , m_isFrameEncoder(false) + {} virtual ~JobProvider() {} - void setThreadPool(ThreadPool *p) { m_pool = p; } - - // Register this job provider with the thread pool, jobs are available - void enqueue(); - - // Remove this job provider from the thread pool, all jobs complete - void dequeue(); - - // Worker threads will call this method to find a job. Must return true if - // work was completed. False if no work was available. - virtual bool findJob(int threadId) = 0; - - // All derived objects that call Enqueue *MUST* call flush before allowing - // their object to be destroyed, otherwise you will see random crashes involving - // partially freed vtables and you will be unhappy - void flush(); + // Worker threads will call this method to perform work + virtual void findJob(int workerThreadId) = 0; - friend class ThreadPoolImpl; - friend class PoolThread; + // Will awaken one idle thread, preferring a thread which most recently + // performed work for this provider. + void tryWakeOne(); }; -// Abstract interface to ThreadPool. Each encoder instance should call -// AllocThreadPool() to get a handle to the singleton object and then make -// it available to their job provider structures (wave-front frame encoders, -// etc). class ThreadPool { -protected: - - // Destructor is inaccessable, force the use of reference counted Release() - ~ThreadPool() {} - - virtual void enqueueJobProvider(JobProvider &) = 0; +public: - virtual void dequeueJobProvider(JobProvider &) = 0; + sleepbitmap_t m_sleepBitmap; + int m_numProviders; + int m_numWorkers; + int m_numaNode; + bool m_isActive; -public: + JobProvider** m_jpTable; + WorkerThread* m_workers; - // When numthreads == 0, a default thread count is used. A request may grow - // an existing pool but it will never shrink. - static ThreadPool *allocThreadPool(int numthreads = 0); + ThreadPool(); + ~ThreadPool(); - static ThreadPool *getThreadPool(); + bool create(int numThreads, int maxProviders, int node); + bool start(); + void stop(); + void setCurrentThreadAffinity(); + int tryAcquireSleepingThread(sleepbitmap_t firstTryBitmap, sleepbitmap_t secondTryBitmap); + int tryBondPeers(int maxPeers, sleepbitmap_t peerBitmap, BondedTaskGroup& master); - virtual void pokeIdleThread() = 0; + static ThreadPool* allocThreadPools(x265_param* p, int& numPools); - // The pool is reference counted so all calls to AllocThreadPool() should be - // followed by a call to Release() - virtual void release() = 0; + static int getCpuCount(); + static int getNumaNodeCount(); + static void setThreadNodeAffinity(int node); +}; - virtual int getThreadCount() const = 0; +/* Any worker thread may enlist the help of idle worker threads from the same + * job provider. They must derive from this class and implement the + * processTasks() method. To use, an instance must be instantiated by a worker + * thread (referred to as the master thread) and then tryBondPeers() must be + * called. If it returns non-zero then some number of slave worker threads are + * already in the process of calling your processTasks() function. The master + * thread should participate and call processTasks() itself. When + * waitForExit() returns, all bonded peer threads are quarunteed to have + * exitied processTasks(). Since the thread count is small, it uses explicit + * locking instead of atomic counters and bitmasks */ +class BondedTaskGroup +{ +public: - friend class JobProvider; + Lock m_lock; + ThreadSafeInteger m_exitedPeerCount; + int m_bondedPeerCount; + int m_jobTotal; + int m_jobAcquired; + + BondedTaskGroup() { m_bondedPeerCount = m_jobTotal = m_jobAcquired = 0; } + + /* Do not allow the instance to be destroyed before all bonded peers have + * exited processTasks() */ + ~BondedTaskGroup() { waitForExit(); } + + /* Try to enlist the help of idle worker threads on most recently associated + * with the given job provider and "bond" them to work on your tasks. Up to + * maxPeers worker threads will call your processTasks() method. */ + int tryBondPeers(JobProvider& jp, int maxPeers) + { + int count = jp.m_pool->tryBondPeers(maxPeers, jp.m_ownerBitmap, *this); + m_bondedPeerCount += count; + return count; + } + + /* Try to enlist the help of any idle worker threads and "bond" them to work + * on your tasks. Up to maxPeers worker threads will call your + * processTasks() method. */ + int tryBondPeers(ThreadPool& pool, int maxPeers) + { + int count = pool.tryBondPeers(maxPeers, ALL_POOL_THREADS, *this); + m_bondedPeerCount += count; + return count; + } + + /* Returns when all bonded peers have exited processTasks(). It does *NOT* + * ensure all tasks are completed (but this is generally implied). */ + void waitForExit() + { + int exited = m_exitedPeerCount.get(); + while (m_bondedPeerCount != exited) + exited = m_exitedPeerCount.waitForChange(exited); + } + + /* Derived classes must define this method. The worker thread ID may be + * used to index into thread local data, or ignored. The ID will be between + * 0 and jp.m_numWorkers - 1 */ + virtual void processTasks(int workerThreadId) = 0; }; + } // end namespace x265 #endif // ifndef X265_THREADPOOL_H
View file
x265_1.5.tar.gz/source/common/wavefront.cpp -> x265_1.6.tar.gz/source/common/wavefront.cpp
Changed
@@ -54,13 +54,13 @@ void WaveFront::clearEnabledRowMask() { memset((void*)m_externalDependencyBitmap, 0, sizeof(uint32_t) * m_numWords); + memset((void*)m_internalDependencyBitmap, 0, sizeof(uint32_t) * m_numWords); } void WaveFront::enqueueRow(int row) { uint32_t bit = 1 << (row & 31); ATOMIC_OR(&m_internalDependencyBitmap[row >> 5], bit); - if (m_pool) m_pool->pokeIdleThread(); } void WaveFront::enableRow(int row) @@ -80,11 +80,11 @@ return !!(ATOMIC_AND(&m_internalDependencyBitmap[row >> 5], ~bit) & bit); } -bool WaveFront::findJob(int threadId) +void WaveFront::findJob(int threadId) { unsigned long id; - // thread safe + /* Loop over each word until all available rows are finished */ for (int w = 0; w < m_numWords; w++) { uint32_t oldval = m_internalDependencyBitmap[w] & m_externalDependencyBitmap[w]; @@ -97,15 +97,14 @@ { /* we cleared the bit, we get to process the row */ processRow(w * 32 + id, threadId); - return true; + m_helpWanted = true; + return; /* check for a higher priority task */ } - // some other thread cleared the bit, try another bit oldval = m_internalDependencyBitmap[w] & m_externalDependencyBitmap[w]; } } - // made it through the bitmap without finding any enqueued rows - return false; + m_helpWanted = false; } }
View file
x265_1.5.tar.gz/source/common/wavefront.h -> x265_1.6.tar.gz/source/common/wavefront.h
Changed
@@ -53,10 +53,9 @@ public: - WaveFront(ThreadPool *pool) - : JobProvider(pool) - , m_internalDependencyBitmap(0) - , m_externalDependencyBitmap(0) + WaveFront() + : m_internalDependencyBitmap(NULL) + , m_externalDependencyBitmap(NULL) {} virtual ~WaveFront(); @@ -86,8 +85,8 @@ // WaveFront's implementation of JobProvider::findJob. Consults // m_queuedBitmap and calls ProcessRow(row) for lowest numbered queued row - // or returns false - bool findJob(int threadId); + // processes available rows and returns when no work remains + void findJob(int threadId); // Start or resume encode processing of this row, must be implemented by // derived classes.
View file
x265_1.5.tar.gz/source/common/x86/asm-primitives.cpp -> x265_1.6.tar.gz/source/common/x86/asm-primitives.cpp
Changed
@@ -44,6 +44,11 @@ p.cu[BLOCK_16x16].prim = fncdef x265_ ## fname ## _16x16_ ## cpu; \ p.cu[BLOCK_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu; \ p.cu[BLOCK_64x64].prim = fncdef x265_ ## fname ## _64x64_ ## cpu +#define ALL_LUMA_CU_TYPED_S(prim, fncdef, fname, cpu) \ + p.cu[BLOCK_8x8].prim = fncdef x265_ ## fname ## 8_ ## cpu; \ + p.cu[BLOCK_16x16].prim = fncdef x265_ ## fname ## 16_ ## cpu; \ + p.cu[BLOCK_32x32].prim = fncdef x265_ ## fname ## 32_ ## cpu; \ + p.cu[BLOCK_64x64].prim = fncdef x265_ ## fname ## 64_ ## cpu #define ALL_LUMA_TU_TYPED(prim, fncdef, fname, cpu) \ p.cu[BLOCK_4x4].prim = fncdef x265_ ## fname ## _4x4_ ## cpu; \ p.cu[BLOCK_8x8].prim = fncdef x265_ ## fname ## _8x8_ ## cpu; \ @@ -61,6 +66,7 @@ p.cu[BLOCK_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu; \ p.cu[BLOCK_64x64].prim = fncdef x265_ ## fname ## _64x64_ ## cpu; #define ALL_LUMA_CU(prim, fname, cpu) ALL_LUMA_CU_TYPED(prim, , fname, cpu) +#define ALL_LUMA_CU_S(prim, fname, cpu) ALL_LUMA_CU_TYPED_S(prim, , fname, cpu) #define ALL_LUMA_TU(prim, fname, cpu) ALL_LUMA_TU_TYPED(prim, , fname, cpu) #define ALL_LUMA_BLOCKS(prim, fname, cpu) ALL_LUMA_BLOCKS_TYPED(prim, , fname, cpu) #define ALL_LUMA_TU_S(prim, fname, cpu) ALL_LUMA_TU_TYPED_S(prim, , fname, cpu) @@ -179,7 +185,6 @@ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].prim = fncdef x265_ ## fname ## _8x32_ ## cpu #define ALL_CHROMA_420_4x4_PU(prim, fname, cpu) ALL_CHROMA_420_4x4_PU_TYPED(prim, , fname, cpu) - #define ALL_CHROMA_422_CU_TYPED(prim, fncdef, fname, cpu) \ p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].prim = fncdef x265_ ## fname ## _4x8_ ## cpu; \ p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].prim = fncdef x265_ ## fname ## _8x16_ ## cpu; \ @@ -791,6 +796,10 @@ void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) // 16bpp { +#if !defined(X86_64) +#error "Unsupported build configuration (32bit x86 and HIGH_BIT_DEPTH), you must configure ENABLE_ASSEMBLY=OFF" +#endif + if (cpuMask & X265_CPU_SSE2) { /* We do not differentiate CPUs which support MMX and not SSE2. We only check @@ -863,6 +872,16 @@ ALL_LUMA_TU_S(calcresidual, getResidual, sse2); ALL_LUMA_TU_S(transpose, transpose, sse2); + p.cu[BLOCK_4x4].intra_pred[DC_IDX] = x265_intra_pred_dc4_sse2; + p.cu[BLOCK_8x8].intra_pred[DC_IDX] = x265_intra_pred_dc8_sse2; + p.cu[BLOCK_16x16].intra_pred[DC_IDX] = x265_intra_pred_dc16_sse2; + p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_sse2; + + p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2; + p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = x265_intra_pred_planar8_sse2; + p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_sse2; + p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_sse2; + p.cu[BLOCK_4x4].sse_ss = x265_pixel_ssd_ss_4x4_mmx2; ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2); @@ -872,10 +891,10 @@ p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sse_pp = (pixelcmp_t)x265_pixel_ssd_ss_32x64_sse2; p.cu[BLOCK_4x4].dct = x265_dct4_sse2; + p.cu[BLOCK_8x8].dct = x265_dct8_sse2; p.cu[BLOCK_4x4].idct = x265_idct4_sse2; -#if X86_64 p.cu[BLOCK_8x8].idct = x265_idct8_sse2; -#endif + p.idst4x4 = x265_idst4_sse2; LUMA_VSS_FILTERS(sse2); @@ -894,7 +913,10 @@ p.dst4x4 = x265_dst4_ssse3; p.cu[BLOCK_8x8].idct = x265_idct8_ssse3; - p.count_nonzero = x265_count_nonzero_ssse3; + p.cu[BLOCK_4x4].count_nonzero = x265_count_nonzero_4x4_ssse3; + p.cu[BLOCK_8x8].count_nonzero = x265_count_nonzero_8x8_ssse3; + p.cu[BLOCK_16x16].count_nonzero = x265_count_nonzero_16x16_ssse3; + p.cu[BLOCK_32x32].count_nonzero = x265_count_nonzero_32x32_ssse3; p.frameInitLowres = x265_frame_init_lowres_core_ssse3; } if (cpuMask & X265_CPU_SSE4) @@ -931,19 +953,30 @@ p.cu[BLOCK_4x4].psy_cost_pp = x265_psyCost_pp_4x4_sse4; p.cu[BLOCK_4x4].psy_cost_ss = x265_psyCost_ss_4x4_sse4; -#if X86_64 + // TODO: check POPCNT flag! + ALL_LUMA_TU_S(copy_cnt, copy_cnt_, sse4); ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4); ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4); -#endif } if (cpuMask & X265_CPU_AVX) { // p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_avx; fails tests + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].satd = x265_pixel_satd_16x24_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].satd = x265_pixel_satd_32x48_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].satd = x265_pixel_satd_24x64_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].satd = x265_pixel_satd_8x64_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].satd = x265_pixel_satd_8x12_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].satd = x265_pixel_satd_12x32_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].satd = x265_pixel_satd_4x32_avx; + ALL_LUMA_PU(satd, pixel_satd, avx); ASSIGN_SA8D(avx); LUMA_VAR(avx); p.ssim_4x4x2_core = x265_pixel_ssim_4x4x2_core_avx; p.ssim_end_4 = x265_pixel_ssim_end4_avx; + + // copy_pp primitives + // 16 x N p.pu[LUMA_64x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x64_avx; p.pu[LUMA_16x4].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x4_avx; p.pu[LUMA_16x8].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x8_avx; @@ -963,11 +996,82 @@ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x16_avx; p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x24_avx; p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x32_avx; + + // 24 X N + p.pu[LUMA_24x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_24x32_avx; + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_24x32_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_24x64_avx; + + // 32 x N + p.pu[LUMA_32x8].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x8_avx; + p.pu[LUMA_32x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x16_avx; + p.pu[LUMA_32x24].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x24_avx; + p.pu[LUMA_32x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x32_avx; + p.pu[LUMA_32x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x64_avx; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x8_avx; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x16_avx; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x24_avx; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x32_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x16_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x32_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x48_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x64_avx; + + // 48 X 64 + p.pu[LUMA_48x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_48x64_avx; + + // copy_ss primitives + // 16 X N + p.cu[BLOCK_16x16].copy_ss = x265_blockcopy_ss_16x16_avx; + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_ss = x265_blockcopy_ss_16x16_avx; + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_ss = x265_blockcopy_ss_16x32_avx; + + // 32 X N + p.cu[BLOCK_32x32].copy_ss = x265_blockcopy_ss_32x32_avx; + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_ss = x265_blockcopy_ss_32x32_avx; + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_ss = x265_blockcopy_ss_32x64_avx; + + // 64 X N + p.cu[BLOCK_64x64].copy_ss = x265_blockcopy_ss_64x64_avx; + + // copy_ps primitives + // 16 X N + p.cu[BLOCK_16x16].copy_ps = (copy_ps_t)x265_blockcopy_ss_16x16_avx; + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_ps = (copy_ps_t)x265_blockcopy_ss_16x16_avx; + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_ps = (copy_ps_t)x265_blockcopy_ss_16x32_avx; + + // 32 X N + p.cu[BLOCK_32x32].copy_ps = (copy_ps_t)x265_blockcopy_ss_32x32_avx; + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_ps = (copy_ps_t)x265_blockcopy_ss_32x32_avx; + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_ps = (copy_ps_t)x265_blockcopy_ss_32x64_avx; + + // 64 X N + p.cu[BLOCK_64x64].copy_ps = (copy_ps_t)x265_blockcopy_ss_64x64_avx; + + // copy_sp primitives + // 16 X N + p.cu[BLOCK_16x16].copy_sp = (copy_sp_t)x265_blockcopy_ss_16x16_avx; + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_sp = (copy_sp_t)x265_blockcopy_ss_16x16_avx; + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_sp = (copy_sp_t)x265_blockcopy_ss_16x32_avx; + + // 32 X N + p.cu[BLOCK_32x32].copy_sp = (copy_sp_t)x265_blockcopy_ss_32x32_avx; + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_sp = (copy_sp_t)x265_blockcopy_ss_32x32_avx; + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_sp = (copy_sp_t)x265_blockcopy_ss_32x64_avx; + + // 64 X N + p.cu[BLOCK_64x64].copy_sp = (copy_sp_t)x265_blockcopy_ss_64x64_avx; + p.frameInitLowres = x265_frame_init_lowres_core_avx; + + p.pu[LUMA_64x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x16_avx; + p.pu[LUMA_64x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x32_avx; + p.pu[LUMA_64x48].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x48_avx; + p.pu[LUMA_64x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x64_avx; } if (cpuMask & X265_CPU_XOP) { - p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_xop; + //p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_xop; this one is broken ALL_LUMA_PU(satd, pixel_satd, xop); ASSIGN_SA8D(xop); LUMA_VAR(xop); @@ -975,36 +1079,48 @@ }
View file
x265_1.5.tar.gz/source/common/x86/blockcopy8.asm -> x265_1.6.tar.gz/source/common/x86/blockcopy8.asm
Changed
@@ -47,15 +47,15 @@ cglobal blockcopy_pp_2x4, 4, 7, 0 mov r4w, [r2] mov r5w, [r2 + r3] - lea r2, [r2 + r3 * 2] - mov r6w, [r2] + mov r6w, [r2 + 2 * r3] + lea r3, [r3 + 2 * r3] mov r3w, [r2 + r3] - mov [r0], r4w - mov [r0 + r1], r5w - lea r0, [r0 + 2 * r1] - mov [r0], r6w - mov [r0 + r1], r3w + mov [r0], r4w + mov [r0 + r1], r5w + mov [r0 + 2 * r1], r6w + lea r1, [r1 + 2 * r1] + mov [r0 + r1], r3w RET ;----------------------------------------------------------------------------- @@ -63,37 +63,29 @@ ;----------------------------------------------------------------------------- INIT_XMM sse2 cglobal blockcopy_pp_2x8, 4, 7, 0 - mov r4w, [r2] - mov r5w, [r2 + r3] - mov r6w, [r2 + 2 * r3] + lea r5, [3 * r1] + lea r6, [3 * r3] - mov [r0], r4w - mov [r0 + r1], r5w - mov [r0 + 2 * r1], r6w - - lea r0, [r0 + 2 * r1] - lea r2, [r2 + 2 * r3] - - mov r4w, [r2 + r3] - mov r5w, [r2 + 2 * r3] - - mov [r0 + r1], r4w - mov [r0 + 2 * r1], r5w - - lea r0, [r0 + 2 * r1] - lea r2, [r2 + 2 * r3] - - mov r4w, [r2 + r3] - mov r5w, [r2 + 2 * r3] - - mov [r0 + r1], r4w - mov [r0 + 2 * r1], r5w - - lea r0, [r0 + 2 * r1] - lea r2, [r2 + 2 * r3] - - mov r4w, [r2 + r3] - mov [r0 + r1], r4w + mov r4w, [r2] + mov [r0], r4w + mov r4w, [r2 + r3] + mov [r0 + r1], r4w + mov r4w, [r2 + 2 * r3] + mov [r0 + 2 * r1], r4w + mov r4w, [r2 + r6] + mov [r0 + r5], r4w + + lea r2, [r2 + 4 * r3] + mov r4w, [r2] + lea r0, [r0 + 4 * r1] + mov [r0], r4w + + mov r4w, [r2 + r3] + mov [r0 + r1], r4w + mov r4w, [r2 + 2 * r3] + mov [r0 + 2 * r1], r4w + mov r4w, [r2 + r6] + mov [r0 + r5], r4w RET ;----------------------------------------------------------------------------- @@ -101,16 +93,30 @@ ;----------------------------------------------------------------------------- INIT_XMM sse2 cglobal blockcopy_pp_2x16, 4, 7, 0 - mov r6d, 16/2 -.loop: - mov r4w, [r2] - mov r5w, [r2 + r3] - dec r6d - lea r2, [r2 + r3 * 2] - mov [r0], r4w - mov [r0 + r1], r5w - lea r0, [r0 + r1 * 2] - jnz .loop + lea r5, [3 * r1] + lea r6, [3 * r3] + + mov r4w, [r2] + mov [r0], r4w + mov r4w, [r2 + r3] + mov [r0 + r1], r4w + mov r4w, [r2 + 2 * r3] + mov [r0 + 2 * r1], r4w + mov r4w, [r2 + r6] + mov [r0 + r5], r4w + +%rep 3 + lea r2, [r2 + 4 * r3] + mov r4w, [r2] + lea r0, [r0 + 4 * r1] + mov [r0], r4w + mov r4w, [r2 + r3] + mov [r0 + r1], r4w + mov r4w, [r2 + 2 * r3] + mov [r0 + 2 * r1], r4w + mov r4w, [r2 + r6] + mov [r0 + r5], r4w +%endrep RET @@ -145,115 +151,130 @@ RET ;----------------------------------------------------------------------------- +; void blockcopy_pp_4x8(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride) +;----------------------------------------------------------------------------- +INIT_XMM sse2 +cglobal blockcopy_pp_4x8, 4, 6, 4 + + lea r4, [3 * r1] + lea r5, [3 * r3] + + movd m0, [r2] + movd m1, [r2 + r3] + movd m2, [r2 + 2 * r3] + movd m3, [r2 + r5] + + movd [r0], m0 + movd [r0 + r1], m1 + movd [r0 + 2 * r1], m2 + movd [r0 + r4], m3 + + lea r2, [r2 + 4 * r3] + movd m0, [r2] + movd m1, [r2 + r3] + movd m2, [r2 + 2 * r3] + movd m3, [r2 + r5] + + lea r0, [r0 + 4 * r1] + movd [r0], m0 + movd [r0 + r1], m1 + movd [r0 + 2 * r1], m2 + movd [r0 + r4], m3 + RET + +;----------------------------------------------------------------------------- ; void blockcopy_pp_%1x%2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride) ;----------------------------------------------------------------------------- %macro BLOCKCOPY_PP_W4_H8 2 INIT_XMM sse2 -cglobal blockcopy_pp_%1x%2, 4, 5, 4 +cglobal blockcopy_pp_%1x%2, 4, 7, 4 mov r4d, %2/8 + lea r5, [3 * r1] + lea r6, [3 * r3] + .loop: movd m0, [r2] movd m1, [r2 + r3] - lea r2, [r2 + 2 * r3] - movd m2, [r2] - movd m3, [r2 + r3] + movd m2, [r2 + 2 * r3] + movd m3, [r2 + r6] - movd [r0], m0 - movd [r0 + r1], m1 - lea r0, [r0 + 2 * r1] - movd [r0], m2 - movd [r0 + r1], m3 + movd [r0], m0 + movd [r0 + r1], m1 + movd [r0 + 2 * r1], m2 + movd [r0 + r5], m3 - lea r0, [r0 + 2 * r1] - lea r2, [r2 + 2 * r3] + lea r2, [r2 + 4 * r3] movd m0, [r2] movd m1, [r2 + r3] - lea r2, [r2 + 2 * r3] - movd m2, [r2] - movd m3, [r2 + r3] + movd m2, [r2 + 2 * r3] + movd m3, [r2 + r6]
View file
x265_1.5.tar.gz/source/common/x86/blockcopy8.h -> x265_1.6.tar.gz/source/common/x86/blockcopy8.h
Changed
@@ -48,6 +48,12 @@ void x265_cpy1Dto2D_shr_8_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift); void x265_cpy1Dto2D_shr_16_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift); void x265_cpy1Dto2D_shr_32_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift); +void x265_cpy2Dto1D_shl_8_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +void x265_cpy2Dto1D_shl_16_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +void x265_cpy2Dto1D_shl_32_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +void x265_cpy2Dto1D_shr_8_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +void x265_cpy2Dto1D_shr_16_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +void x265_cpy2Dto1D_shr_32_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); uint32_t x265_copy_cnt_4_sse4(int16_t* dst, const int16_t* src, intptr_t srcStride); uint32_t x265_copy_cnt_8_sse4(int16_t* dst, const int16_t* src, intptr_t srcStride); uint32_t x265_copy_cnt_16_sse4(int16_t* dst, const int16_t* src, intptr_t srcStride); @@ -198,6 +204,15 @@ void x265_blockcopy_ss_64x32_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); void x265_blockcopy_ss_64x48_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); void x265_blockcopy_ss_64x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +void x265_blockcopy_ss_32x8_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +void x265_blockcopy_ss_32x16_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +void x265_blockcopy_ss_32x24_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +void x265_blockcopy_ss_32x32_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +void x265_blockcopy_ss_32x48_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +void x265_blockcopy_ss_32x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +void x265_blockcopy_ss_48x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +void x265_blockcopy_ss_24x32_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +void x265_blockcopy_ss_24x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); void x265_blockcopy_pp_32x8_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); void x265_blockcopy_pp_32x16_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); @@ -205,9 +220,36 @@ void x265_blockcopy_pp_32x32_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); void x265_blockcopy_pp_32x48_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); void x265_blockcopy_pp_32x64_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); +void x265_blockcopy_pp_64x16_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); +void x265_blockcopy_pp_64x32_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); +void x265_blockcopy_pp_64x48_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); +void x265_blockcopy_pp_64x64_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); +void x265_blockcopy_pp_48x64_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); void x265_blockfill_s_16x16_avx2(int16_t* dst, intptr_t dstride, int16_t val); void x265_blockfill_s_32x32_avx2(int16_t* dst, intptr_t dstride, int16_t val); +// copy_sp primitives +// 16 x N +void x265_blockcopy_sp_16x16_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_sp_16x32_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); + +// 32 x N +void x265_blockcopy_sp_32x32_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_sp_32x64_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); + +// 64 x N +void x265_blockcopy_sp_64x64_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +// copy_ps primitives +// 16 x N +void x265_blockcopy_ps_16x16_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); +void x265_blockcopy_ps_16x32_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); + +// 32 x N +void x265_blockcopy_ps_32x32_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); +void x265_blockcopy_ps_32x64_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); + +// 64 x N +void x265_blockcopy_ps_64x64_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); #undef BLOCKCOPY_COMMON #undef BLOCKCOPY_SS_PP
View file
x265_1.5.tar.gz/source/common/x86/const-a.asm -> x265_1.6.tar.gz/source/common/x86/const-a.asm
Changed
@@ -6,7 +6,7 @@ ;* Authors: Loren Merritt <lorenm@u.washington.edu> ;* Fiona Glaser <fiona@x264.com> ;* Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com> -;* +;* Praveen Kumar Tiwari <praveen@multicorewareinc.com> ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by ;* the Free Software Foundation; either version 2 of the License, or @@ -37,11 +37,14 @@ const pw_32, times 16 dw 32 const pw_128, times 16 dw 128 const pw_256, times 16 dw 256 +const pw_257, times 16 dw 257 const pw_512, times 16 dw 512 const pw_1023, times 8 dw 1023 +ALIGN 32 const pw_1024, times 16 dw 1024 const pw_4096, times 16 dw 4096 const pw_00ff, times 16 dw 0x00ff +ALIGN 32 const pw_pixel_max,times 16 dw ((1 << BIT_DEPTH)-1) const deinterleave_shufd, dd 0,4,1,5,2,6,3,7 const pb_unpackbd1, times 2 db 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 @@ -50,16 +53,16 @@ const pb_unpackwq2, db 4,5,4,5,4,5,4,5,6,7,6,7,6,7,6,7 const pw_swap, times 2 db 6,7,4,5,2,3,0,1 -const pb_2, times 16 db 2 -const pb_4, times 16 db 4 -const pb_16, times 16 db 16 -const pb_64, times 16 db 64 +const pb_2, times 32 db 2 +const pb_4, times 32 db 4 +const pb_16, times 32 db 16 +const pb_64, times 32 db 64 const pb_01, times 8 db 0,1 const pb_0, times 16 db 0 const pb_a1, times 16 db 0xa1 const pb_3, times 16 db 3 -const pb_8, times 16 db 8 -const pb_32, times 16 db 32 +const pb_8, times 32 db 8 +const pb_32, times 32 db 32 const pb_128, times 16 db 128 const pb_shuf8x8c, db 0,0,0,0,2,2,2,2,4,4,4,4,6,6,6,6 @@ -72,7 +75,7 @@ const pw_256, times 8 dw 256 const pw_32_0, times 4 dw 32, times 4 dw 0 -const pw_2000, times 8 dw 0x2000 +const pw_2000, times 16 dw 0x2000 const pw_8000, times 8 dw 0x8000 const pw_3fff, times 8 dw 0x3fff const pw_ppppmmmm, dw 1,1,1,1,-1,-1,-1,-1 @@ -80,7 +83,7 @@ const pw_pmpmpmpm, dw 1,-1,1,-1,1,-1,1,-1 const pw_pmmpzzzz, dw 1,-1,-1,1,0,0,0,0 const pd_1, times 8 dd 1 -const pd_2, times 4 dd 2 +const pd_2, times 8 dd 2 const pd_4, times 4 dd 4 const pd_8, times 4 dd 8 const pd_16, times 4 dd 16
View file
x265_1.5.tar.gz/source/common/x86/dct8.asm -> x265_1.6.tar.gz/source/common/x86/dct8.asm
Changed
@@ -748,6 +748,368 @@ movhps [r1 + r2], m1 RET +;------------------------------------------------------- +; void dct8(const int16_t* src, int16_t* dst, intptr_t srcStride) +;------------------------------------------------------- +INIT_XMM sse2 +cglobal dct8, 3,6,8,0-16*mmsize + ;------------------------ + ; Stack Mapping(dword) + ;------------------------ + ; Row0[0-3] Row1[0-3] + ; ... + ; Row6[0-3] Row7[0-3] + ; Row0[0-3] Row7[0-3] + ; ... + ; Row6[4-7] Row7[4-7] + ;------------------------ +%if BIT_DEPTH == 10 + %define DCT_SHIFT1 4 + %define DCT_ADD1 [pd_8] +%elif BIT_DEPTH == 8 + %define DCT_SHIFT1 2 + %define DCT_ADD1 [pd_2] +%else + %error Unsupported BIT_DEPTH! +%endif +%define DCT_ADD2 [pd_256] +%define DCT_SHIFT2 9 + + add r2, r2 + lea r3, [r2 * 3] + mov r5, rsp +%assign x 0 +%rep 2 + movu m0, [r0] + movu m1, [r0 + r2] + movu m2, [r0 + r2 * 2] + movu m3, [r0 + r3] + + punpcklwd m4, m0, m1 + punpckhwd m0, m1 + punpcklwd m5, m2, m3 + punpckhwd m2, m3 + punpckldq m1, m4, m5 ; m1 = [1 0] + punpckhdq m4, m5 ; m4 = [3 2] + punpckldq m3, m0, m2 + punpckhdq m0, m2 + pshufd m2, m3, 0x4E ; m2 = [4 5] + pshufd m0, m0, 0x4E ; m0 = [6 7] + + paddw m3, m1, m0 + psubw m1, m0 ; m1 = [d1 d0] + paddw m0, m4, m2 + psubw m4, m2 ; m4 = [d3 d2] + punpcklqdq m2, m3, m0 ; m2 = [s2 s0] + punpckhqdq m3, m0 + pshufd m3, m3, 0x4E ; m3 = [s1 s3] + + punpcklwd m0, m1, m4 ; m0 = [d2/d0] + punpckhwd m1, m4 ; m1 = [d3/d1] + punpckldq m4, m0, m1 ; m4 = [d3 d1 d2 d0] + punpckhdq m0, m1 ; m0 = [d3 d1 d2 d0] + + ; odd + lea r4, [tab_dct8_1] + pmaddwd m1, m4, [r4 + 0*16] + pmaddwd m5, m0, [r4 + 0*16] + pshufd m1, m1, 0xD8 + pshufd m5, m5, 0xD8 + mova m7, m1 + punpckhqdq m7, m5 + punpcklqdq m1, m5 + paddd m1, m7 + paddd m1, DCT_ADD1 + psrad m1, DCT_SHIFT1 + %if x == 1 + pshufd m1, m1, 0x1B + %endif + mova [r5 + 1*2*mmsize], m1 ; Row 1 + + pmaddwd m1, m4, [r4 + 1*16] + pmaddwd m5, m0, [r4 + 1*16] + pshufd m1, m1, 0xD8 + pshufd m5, m5, 0xD8 + mova m7, m1 + punpckhqdq m7, m5 + punpcklqdq m1, m5 + paddd m1, m7 + paddd m1, DCT_ADD1 + psrad m1, DCT_SHIFT1 + %if x == 1 + pshufd m1, m1, 0x1B + %endif + mova [r5 + 3*2*mmsize], m1 ; Row 3 + + pmaddwd m1, m4, [r4 + 2*16] + pmaddwd m5, m0, [r4 + 2*16] + pshufd m1, m1, 0xD8 + pshufd m5, m5, 0xD8 + mova m7, m1 + punpckhqdq m7, m5 + punpcklqdq m1, m5 + paddd m1, m7 + paddd m1, DCT_ADD1 + psrad m1, DCT_SHIFT1 + %if x == 1 + pshufd m1, m1, 0x1B + %endif + mova [r5 + 5*2*mmsize], m1 ; Row 5 + + pmaddwd m4, [r4 + 3*16] + pmaddwd m0, [r4 + 3*16] + pshufd m4, m4, 0xD8 + pshufd m0, m0, 0xD8 + mova m7, m4 + punpckhqdq m7, m0 + punpcklqdq m4, m0 + paddd m4, m7 + paddd m4, DCT_ADD1 + psrad m4, DCT_SHIFT1 + %if x == 1 + pshufd m4, m4, 0x1B + %endif + mova [r5 + 7*2*mmsize], m4; Row 7 + + ; even + lea r4, [tab_dct4] + paddw m0, m2, m3 ; m0 = [EE1 EE0] + pshufd m0, m0, 0xD8 + pshuflw m0, m0, 0xD8 + pshufhw m0, m0, 0xD8 + psubw m2, m3 ; m2 = [EO1 EO0] + pmullw m2, [pw_ppppmmmm] + pshufd m2, m2, 0xD8 + pshuflw m2, m2, 0xD8 + pshufhw m2, m2, 0xD8 + pmaddwd m3, m0, [r4 + 0*16] + paddd m3, DCT_ADD1 + psrad m3, DCT_SHIFT1 + %if x == 1 + pshufd m3, m3, 0x1B + %endif + mova [r5 + 0*2*mmsize], m3 ; Row 0 + pmaddwd m0, [r4 + 2*16] + paddd m0, DCT_ADD1 + psrad m0, DCT_SHIFT1 + %if x == 1 + pshufd m0, m0, 0x1B + %endif + mova [r5 + 4*2*mmsize], m0 ; Row 4 + pmaddwd m3, m2, [r4 + 1*16] + paddd m3, DCT_ADD1 + psrad m3, DCT_SHIFT1 + %if x == 1 + pshufd m3, m3, 0x1B + %endif + mova [r5 + 2*2*mmsize], m3 ; Row 2 + pmaddwd m2, [r4 + 3*16] + paddd m2, DCT_ADD1 + psrad m2, DCT_SHIFT1 + %if x == 1 + pshufd m2, m2, 0x1B + %endif + mova [r5 + 6*2*mmsize], m2 ; Row 6 + + %if x != 1 + lea r0, [r0 + r2 * 4] + add r5, mmsize + %endif +%assign x x+1 +%endrep + + mov r0, rsp ; r0 = pointer to Low Part + lea r4, [tab_dct8_2] + +%assign x 0 +%rep 4 + mova m0, [r0 + 0*2*mmsize] ; [3 2 1 0] + mova m1, [r0 + 1*2*mmsize] + paddd m2, m0, [r0 + (0*2+1)*mmsize] + pshufd m2, m2, 0x9C ; m2 = [s2 s1 s3 s0] + paddd m3, m1, [r0 + (1*2+1)*mmsize] + pshufd m3, m3, 0x9C ; m3 = ^^ + psubd m0, [r0 + (0*2+1)*mmsize] ; m0 = [d3 d2 d1 d0] + psubd m1, [r0 + (1*2+1)*mmsize] ; m1 = ^^ + + ; even + pshufd m4, m2, 0xD8 + pshufd m3, m3, 0xD8 + mova m7, m4 + punpckhqdq m7, m3 + punpcklqdq m4, m3 + mova m2, m4 + paddd m4, m7 ; m4 = [EE1 EE0 EE1 EE0] + psubd m2, m7 ; m2 = [EO1 EO0 EO1 EO0] + + pslld m4, 6 ; m4 = [64*EE1 64*EE0] + mova m5, m2
View file
x265_1.5.tar.gz/source/common/x86/dct8.h -> x265_1.6.tar.gz/source/common/x86/dct8.h
Changed
@@ -24,6 +24,7 @@ #ifndef X265_DCT8_H #define X265_DCT8_H void x265_dct4_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride); +void x265_dct8_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride); void x265_dst4_ssse3(const int16_t* src, int16_t* dst, intptr_t srcStride); void x265_dct8_sse4(const int16_t* src, int16_t* dst, intptr_t srcStride); void x265_dct4_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
View file
x265_1.5.tar.gz/source/common/x86/intrapred.h -> x265_1.6.tar.gz/source/common/x86/intrapred.h
Changed
@@ -4,7 +4,7 @@ * Copyright (C) 2003-2013 x264 project * * Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com> - * + * Praveen Kumar Tiwari <praveen@multicorewareinc.com> * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -26,11 +26,19 @@ #ifndef X265_INTRAPRED_H #define X265_INTRAPRED_H -void x265_intra_pred_dc4_sse4 (pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); +void x265_intra_pred_dc4_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); +void x265_intra_pred_dc8_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); +void x265_intra_pred_dc16_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); +void x265_intra_pred_dc32_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); +void x265_intra_pred_dc4_sse4(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); void x265_intra_pred_dc8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter); void x265_intra_pred_dc16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter); void x265_intra_pred_dc32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter); +void x265_intra_pred_planar4_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); +void x265_intra_pred_planar8_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); +void x265_intra_pred_planar16_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); +void x265_intra_pred_planar32_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); void x265_intra_pred_planar4_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); void x265_intra_pred_planar8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); void x265_intra_pred_planar16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); @@ -39,6 +47,15 @@ #define DECL_ANG(bsize, mode, cpu) \ void x265_intra_pred_ang ## bsize ## _ ## mode ## _ ## cpu(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +DECL_ANG(4, 2, sse2); +DECL_ANG(4, 3, sse2); +DECL_ANG(4, 4, sse2); +DECL_ANG(4, 5, sse2); +DECL_ANG(4, 6, sse2); +DECL_ANG(4, 7, sse2); +DECL_ANG(4, 8, sse2); +DECL_ANG(4, 9, sse2); + DECL_ANG(4, 2, ssse3); DECL_ANG(4, 3, sse4); DECL_ANG(4, 4, sse4); @@ -157,6 +174,44 @@ DECL_ANG(32, 33, sse4); #undef DECL_ANG +void x265_intra_pred_ang8_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_34_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_2_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_26_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
View file
x265_1.5.tar.gz/source/common/x86/intrapred16.asm -> x265_1.6.tar.gz/source/common/x86/intrapred16.asm
Changed
@@ -65,6 +65,10 @@ pw_planar16_1: dw 15, 15, 15, 15, 15, 15, 15, 15 pd_planar32_1: dd 31, 31, 31, 31 +pw_planar32_1: dw 31, 31, 31, 31, 31, 31, 31, 31 +pw_planar32_L: dw 31, 30, 29, 28, 27, 26, 25, 24 +pw_planar32_H: dw 23, 22, 21, 20, 19, 18, 17, 16 + const planar32_table %assign x 31 %rep 8 @@ -82,15 +86,19 @@ SECTION .text cextern pw_1 +cextern pw_2 cextern pw_4 cextern pw_8 cextern pw_16 +cextern pw_32 cextern pw_1023 cextern pd_16 cextern pd_32 cextern pw_4096 cextern multiL cextern multiH +cextern multiH2 +cextern multiH3 cextern multi_2Row cextern pw_swap cextern pb_unpackwq1 @@ -99,6 +107,592 @@ ;----------------------------------------------------------------------------------- ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter) ;----------------------------------------------------------------------------------- +INIT_XMM sse2 +cglobal intra_pred_dc4, 5,6,2 + movh m0, [r2 + 18] ; sumAbove + movh m1, [r2 + 2] ; sumLeft + + paddw m0, m1 + pshuflw m1, m0, 0x4E + paddw m0, m1 + pshuflw m1, m0, 0xB1 + paddw m0, m1 + + test r4d, r4d + + paddw m0, [pw_4] + psraw m0, 3 + + ; store DC 4x4 + movh [r0], m0 + movh [r0 + r1 * 2], m0 + movh [r0 + r1 * 4], m0 + lea r5, [r0 + r1 * 4] + movh [r5 + r1 * 2], m0 + + ; do DC filter + jz .end + movh m1, m0 + psllw m1, 1 + paddw m1, [pw_2] + movd r3d, m1 + paddw m0, m1 + ; filter top + movh m1, [r2 + 2] + paddw m1, m0 + psraw m1, 2 + movh [r0], m1 ; overwrite top-left pixel, we will update it later + + ; filter top-left + movzx r3d, r3w + movzx r4d, word [r2 + 18] + add r3d, r4d + movzx r4d, word [r2 + 2] + add r4d, r3d + shr r4d, 2 + mov [r0], r4w + + ; filter left + movu m1, [r2 + 20] + paddw m1, m0 + psraw m1, 2 + movd r3d, m1 + mov [r0 + r1 * 2], r3w + shr r3d, 16 + mov [r0 + r1 * 4], r3w + pextrw r3d, m1, 2 + mov [r5 + r1 * 2], r3w +.end: + RET + +;----------------------------------------------------------------------------------- +; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter) +;----------------------------------------------------------------------------------- +INIT_XMM sse2 +cglobal intra_pred_dc8, 5, 8, 2 + movu m0, [r2 + 34] + movu m1, [r2 + 2] + + paddw m0, m1 + movhlps m1, m0 + paddw m0, m1 + pshufd m1, m0, 1 + paddw m0, m1 + pmaddwd m0, [pw_1] + + paddw m0, [pw_8] + psraw m0, 4 ; sum = sum / 16 + pshuflw m0, m0, 0 + pshufd m0, m0, 0 ; m0 = word [dc_val ...] + + test r4d, r4d + + ; store DC 8x8 + lea r6, [r1 + r1 * 4] + lea r6, [r6 + r1] + lea r5, [r6 + r1 * 4] + lea r7, [r6 + r1 * 8] + movu [r0], m0 + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 4], m0 + movu [r0 + r6], m0 + movu [r0 + r1 * 8], m0 + movu [r0 + r5], m0 + movu [r0 + r6 * 2], m0 + movu [r0 + r7], m0 + + ; Do DC Filter + jz .end + mova m1, [pw_2] + pmullw m1, m0 + paddw m1, [pw_2] + movd r4d, m1 ; r4d = DC * 2 + 2 + paddw m1, m0 ; m1 = DC * 3 + 2 + pshuflw m1, m1, 0 + pshufd m1, m1, 0 ; m1 = pixDCx3 + + ; filter top + movu m0, [r2 + 2] + paddw m0, m1 + psraw m0, 2 + movu [r0], m0 + + ; filter top-left + movzx r4d, r4w + movzx r3d, word [r2 + 34] + add r4d, r3d + movzx r3d, word [r2 + 2] + add r3d, r4d + shr r3d, 2 + mov [r0], r3w + + ; filter left + movu m0, [r2 + 36] + paddw m0, m1 + psraw m0, 2 + movh r3, m0 + mov [r0 + r1 * 2], r3w + shr r3, 16 + mov [r0 + r1 * 4], r3w + shr r3, 16 + mov [r0 + r6], r3w + shr r3, 16 + mov [r0 + r1 * 8], r3w + pshufd m0, m0, 0x6E + movh r3, m0 + mov [r0 + r5], r3w + shr r3, 16 + mov [r0 + r6 * 2], r3w + shr r3, 16 + mov [r0 + r7], r3w +.end: + RET + +;------------------------------------------------------------------------------------------------------- +; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* left, pixel* above, int dirMode, int filter) +;------------------------------------------------------------------------------------------------------- +INIT_XMM sse2 +cglobal intra_pred_dc16, 5, 10, 4 + lea r3, [r2 + 66] + add r1, r1 + movu m0, [r3] + movu m1, [r3 + 16] + movu m2, [r2 + 2] + movu m3, [r2 + 18] + + paddw m0, m1 + paddw m2, m3 + paddw m0, m2 + movhlps m1, m0 + paddw m0, m1 + pshuflw m1, m0, 0x6E + paddw m0, m1 + pmaddwd m0, [pw_1] + + paddw m0, [pw_16] + psraw m0, 5 + movd r5d, m0
View file
x265_1.5.tar.gz/source/common/x86/intrapred8.asm -> x265_1.6.tar.gz/source/common/x86/intrapred8.asm
Changed
@@ -2,6 +2,7 @@ ;* Copyright (C) 2013 x265 project ;* ;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com> +;* Praveen Kumar Tiwari <praveen@multicorewareinc.com> ;* ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by @@ -26,11 +27,15 @@ SECTION_RODATA 32 +intra_pred_shuff_0_8: times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 + pb_0_8 times 8 db 0, 8 pb_unpackbw1 times 2 db 1, 8, 2, 8, 3, 8, 4, 8 pb_swap8: times 2 db 7, 6, 5, 4, 3, 2, 1, 0 c_trans_4x4 db 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15 -tab_Si: db 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7 +const tab_S1, db 15, 14, 12, 11, 10, 9, 7, 6, 5, 4, 2, 1, 0, 0, 0, 0 +const tab_S2, db 0, 1, 3, 5, 7, 9, 11, 13, 0, 0, 0, 0, 0, 0, 0, 0 +const tab_Si, db 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7 pb_fact0: db 0, 2, 4, 6, 8, 10, 12, 14, 0, 0, 0, 0, 0, 0, 0, 0 c_mode32_12_0: db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 7, 0 c_mode32_13_0: db 3, 6, 10, 13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 @@ -43,7 +48,6 @@ c_mode32_18_0: db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 c_shuf8_0: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 c_deinterval8: db 0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15 -tab_S1: db 15, 14, 12, 11, 10, 9, 7, 6, 5, 4, 2, 1, 0, 0, 0, 0 pb_unpackbq: db 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 c_mode16_12: db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 6 c_mode16_13: db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 11, 7, 4 @@ -52,8 +56,327 @@ c_mode16_16: db 8, 6, 5, 3, 2, 0, 15, 14, 12, 11, 9, 8, 6, 5, 3, 2 c_mode16_17: db 4, 2, 1, 0, 15, 14, 12, 11, 10, 9, 7, 6, 5, 4, 2, 1 c_mode16_18: db 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 -tab_S2: db 0, 1, 3, 5, 7, 9, 11, 13, 0, 0, 0, 0, 0, 0, 0, 0 +ALIGN 32 +trans8_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7 +c_ang8_src1_9_2_10: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9 +c_ang8_26_20: db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 +c_ang8_src3_11_4_12: db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11 +c_ang8_14_8: db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 +c_ang8_src5_13_5_13: db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12 +c_ang8_2_28: db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 +c_ang8_src6_14_7_15: db 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14 +c_ang8_22_16: db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + +c_ang8_21_10 : db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 +c_ang8_src2_10_3_11: db 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10 +c_ang8_31_20: db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 +c_ang8_src4_12_4_12: times 2 db 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11 +c_ang8_9_30: db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 +c_ang8_src5_13_6_14: db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13 +c_ang8_19_8: db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + +c_ang8_17_2: db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 +c_ang8_19_4: db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 +c_ang8_21_6: db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 +c_ang8_23_8: db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, +c_ang8_src4_12_5_13: db 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12 + +c_ang8_13_26: db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 +c_ang8_7_20: db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 +c_ang8_1_14: db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 +c_ang8_27_8: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 +c_ang8_src2_10_2_10: db 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9 +c_ang8_src3_11_3_11: db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10 + +c_ang8_31_8: db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 +c_ang8_13_22: db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 +c_ang8_27_4: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 +c_ang8_9_18: db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + +c_ang8_5_10: db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 +c_ang8_15_20: db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 +c_ang8_25_30: db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 +c_ang8_3_8: db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + +c_ang8_mode_27: db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + +c_ang8_mode_25: db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + +c_ang8_mode_24: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 + db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 + db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + +ALIGN 32 +c_ang16_mode_25: db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + + +ALIGN 32 +c_ang16_mode_28: db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 + db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 + db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + + +ALIGN 32 +c_ang16_mode_27: db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 + db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + +ALIGN 32 +intra_pred_shuff_0_15: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 15 + + +ALIGN 32 +c_ang16_mode_29: db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 + db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 + db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 + db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 + db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + + +ALIGN 32 +c_ang16_mode_30: db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 + db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 + db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 + db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 + db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + + +ALIGN 32 +c_ang16_mode_31: db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19 + db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 + db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 + db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + +ALIGN 32 +c_ang16_mode_32: db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 + db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 + db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 + db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + +ALIGN 32 +c_ang16_mode_33: db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 + db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 + db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 + db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 + db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + +ALIGN 32 +c_ang16_mode_24: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 + db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
View file
x265_1.6.tar.gz/source/common/x86/intrapred8_allangs.asm
Added
@@ -0,0 +1,23008 @@ +;***************************************************************************** +;* Copyright (C) 2013 x265 project +;* +;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com> +;* Praveen Tiwari <praveen@multicorewareinc.com> +;* +;* This program is free software; you can redistribute it and/or modify +;* it under the terms of the GNU General Public License as published by +;* the Free Software Foundation; either version 2 of the License, or +;* (at your option) any later version. +;* +;* This program is distributed in the hope that it will be useful, +;* but WITHOUT ANY WARRANTY; without even the implied warranty of +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;* GNU General Public License for more details. +;* +;* You should have received a copy of the GNU General Public License +;* along with this program; if not, write to the Free Software +;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. +;* +;* This program is also available under a commercial proprietary license. +;* For more information, contact us at license @ x265.com. +;*****************************************************************************/ + +%include "x86inc.asm" +%include "x86util.asm" + +SECTION_RODATA 32 + +SECTION .text + +; global constant +cextern pw_1024 + +; common constant with intrapred8.asm +cextern ang_table +cextern tab_S1 +cextern tab_S2 +cextern tab_Si + + +;----------------------------------------------------------------------------- +; void all_angs_pred_4x4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma) +;----------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal all_angs_pred_4x4, 4, 4, 8 + +; mode 2 + +movh m0, [r1 + 10] +movd [r0], m0 + +palignr m1, m0, 1 +movd [r0 + 4], m1 + +palignr m1, m0, 2 +movd [r0 + 8], m1 + +palignr m1, m0, 3 +movd [r0 + 12], m1 + +; mode 3 + +mova m2, [pw_1024] + +pslldq m1, m0, 1 +pinsrb m1, [r1 + 9], 0 +punpcklbw m1, m0 + +lea r3, [ang_table] + +pmaddubsw m6, m1, [r3 + 26 * 16] +pmulhrsw m6, m2 +packuswb m6, m6 +movd [r0 + 16], m6 + +palignr m0, m1, 2 + +mova m7, [r3 + 20 * 16] + +pmaddubsw m3, m0, m7 +pmulhrsw m3, m2 +packuswb m3, m3 +movd [r0 + 20], m3 + +; mode 6 [row 3] +movd [r0 + 76], m3 + +palignr m3, m1, 4 + +pmaddubsw m4, m3, [r3 + 14 * 16] +pmulhrsw m4, m2 +packuswb m4, m4 +movd [r0 + 24], m4 + +palignr m4, m1, 6 + +pmaddubsw m4, [r3 + 8 * 16] +pmulhrsw m4, m2 +packuswb m4, m4 +movd [r0 + 28], m4 + +; mode 4 + +pmaddubsw m5, m1, [r3 + 21 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 32], m5 + +pmaddubsw m5, m0, [r3 + 10 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 36], m5 + +pmaddubsw m5, m0, [r3 + 31 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 40], m5 + +pmaddubsw m4, m3, m7 +pmulhrsw m4, m2 +packuswb m4, m4 +movd [r0 + 44], m4 + +; mode 5 + +pmaddubsw m5, m1, [r3 + 17 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 48], m5 + +pmaddubsw m5, m0, [r3 + 2 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 52], m5 + +pmaddubsw m5, m0, [r3 + 19 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 56], m5 + +pmaddubsw m4, m3, [r3 + 4 * 16] +pmulhrsw m4, m2 +packuswb m4, m4 +movd [r0 + 60], m4 + +; mode 6 + +pmaddubsw m5, m1, [r3 + 13 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 64], m5 + +movd [r0 + 68], m6 + +pmaddubsw m5, m0, [r3 + 7 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 72], m5 + +; mode 7 + +pmaddubsw m5, m1, [r3 + 9 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 80], m5 + +pmaddubsw m5, m1, [r3 + 18 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 84], m5 + +pmaddubsw m5, m1, [r3 + 27 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 88], m5 + +pmaddubsw m5, m0, [r3 + 4 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 92], m5 + +; mode 8 + +pmaddubsw m5, m1, [r3 + 5 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 96], m5 + +pmaddubsw m5, m1, [r3 + 10 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 100], m5 + +pmaddubsw m5, m1, [r3 + 15 * 16] +pmulhrsw m5, m2 +packuswb m5, m5 +movd [r0 + 104], m5 +
View file
x265_1.5.tar.gz/source/common/x86/ipfilter16.asm -> x265_1.6.tar.gz/source/common/x86/ipfilter16.asm
Changed
@@ -31,6 +31,7 @@ tab_c_n32768: times 4 dd -32768 tab_c_524800: times 4 dd 524800 tab_c_n8192: times 8 dw -8192 +pd_524800: times 8 dd 524800 tab_Tm16: db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9 @@ -91,9 +92,28 @@ times 4 dw -5, 17 times 4 dw 58, -10 times 4 dw 4, -1 +ALIGN 32 +tab_LumaCoeffVer: times 8 dw 0, 0 + times 8 dw 0, 64 + times 8 dw 0, 0 + times 8 dw 0, 0 + + times 8 dw -1, 4 + times 8 dw -10, 58 + times 8 dw 17, -5 + times 8 dw 1, 0 + + times 8 dw -1, 4 + times 8 dw -11, 40 + times 8 dw 40, -11 + times 8 dw 4, -1 + + times 8 dw 0, 1 + times 8 dw -5, 17 + times 8 dw 58, -10 + times 8 dw 4, -1 SECTION .text - cextern pd_32 cextern pw_pixel_max cextern pd_n32768 @@ -2562,6 +2582,2681 @@ FILTER_VER_LUMA_PP 64, 16 FILTER_VER_LUMA_PP 16, 64 +%macro FILTER_VER_LUMA_AVX2_4x4 1 +INIT_YMM avx2 +cglobal interp_8tap_vert_%1_4x4, 4, 6, 7 + mov r4d, r4m + add r1d, r1d + add r3d, r3d + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_LumaCoeffVer] + add r5, r4 +%else + lea r5, [tab_LumaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r4 + +%ifidn %1,pp + vbroadcasti128 m6, [pd_32] +%elifidn %1, sp + mova m6, [pd_524800] +%else + vbroadcasti128 m6, [pd_n32768] +%endif + + movq xm0, [r0] + movq xm1, [r0 + r1] + punpcklwd xm0, xm1 + movq xm2, [r0 + r1 * 2] + punpcklwd xm1, xm2 + vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] + pmaddwd m0, [r5] + movq xm3, [r0 + r4] + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movq xm4, [r0] + punpcklwd xm3, xm4 + vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] + pmaddwd m5, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m5 + movq xm3, [r0 + r1] + punpcklwd xm4, xm3 + movq xm1, [r0 + r1 * 2] + punpcklwd xm3, xm1 + vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] + pmaddwd m5, m4, [r5 + 2 * mmsize] + pmaddwd m4, [r5 + 1 * mmsize] + paddd m0, m5 + paddd m2, m4 + movq xm3, [r0 + r4] + punpcklwd xm1, xm3 + lea r0, [r0 + 4 * r1] + movq xm4, [r0] + punpcklwd xm3, xm4 + vinserti128 m1, m1, xm3, 1 ; m1 = [8 7 7 6] + pmaddwd m5, m1, [r5 + 3 * mmsize] + pmaddwd m1, [r5 + 2 * mmsize] + paddd m0, m5 + paddd m2, m1 + movq xm3, [r0 + r1] + punpcklwd xm4, xm3 + movq xm1, [r0 + 2 * r1] + punpcklwd xm3, xm1 + vinserti128 m4, m4, xm3, 1 ; m4 = [A 9 9 8] + pmaddwd m4, [r5 + 3 * mmsize] + paddd m2, m4 + +%ifidn %1,ss + psrad m0, 6 + psrad m2, 6 +%else + paddd m0, m6 + paddd m2, m6 +%ifidn %1,pp + psrad m0, 6 + psrad m2, 6 +%elifidn %1, sp + psrad m0, 10 + psrad m2, 10 +%else + psrad m0, 2 + psrad m2, 2 +%endif +%endif + + packssdw m0, m2 + pxor m1, m1 +%ifidn %1,pp + CLIPW m0, m1, [pw_pixel_max] +%elifidn %1, sp + CLIPW m0, m1, [pw_pixel_max] +%endif + + vextracti128 xm2, m0, 1 + lea r4, [r3 * 3] + movq [r2], xm0 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r4], xm2 + RET +%endmacro + +FILTER_VER_LUMA_AVX2_4x4 pp +FILTER_VER_LUMA_AVX2_4x4 ps +FILTER_VER_LUMA_AVX2_4x4 sp +FILTER_VER_LUMA_AVX2_4x4 ss + +%macro FILTER_VER_LUMA_AVX2_8x8 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_vert_%1_8x8, 4, 6, 12 + mov r4d, r4m + add r1d, r1d + add r3d, r3d + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_LumaCoeffVer] + add r5, r4 +%else + lea r5, [tab_LumaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r4 + +%ifidn %1,pp + vbroadcasti128 m11, [pd_32] +%elifidn %1, sp + mova m11, [pd_524800] +%else + vbroadcasti128 m11, [pd_n32768] +%endif + + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m4 + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4
View file
x265_1.5.tar.gz/source/common/x86/ipfilter8.asm -> x265_1.6.tar.gz/source/common/x86/ipfilter8.asm
Changed
@@ -35,10 +35,20 @@ const interp4_vpp_shuf, times 2 db 0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, 13, 10, 14, 11, 15 ALIGN 32 +const interp_vert_shuf, times 2 db 0, 2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 8, 7, 9 + times 2 db 4, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 11, 10, 12, 11, 13 + +ALIGN 32 const interp4_vpp_shuf1, dd 0, 1, 1, 2, 2, 3, 3, 4 dd 2, 3, 3, 4, 4, 5, 5, 6 ALIGN 32 +const pb_8tap_hps_0, times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 + times 2 db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10 + times 2 db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12 + times 2 db 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12,12,13,13,14 + +ALIGN 32 tab_Lm: db 0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8 db 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10 db 4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12 @@ -51,6 +61,8 @@ tab_c_526336: times 4 dd 8192*64+2048 +pd_526336: times 8 dd 8192*64+2048 + tab_ChromaCoeff: db 0, 64, 0, 0 db -2, 58, 10, -2 db -4, 54, 16, -2 @@ -59,6 +71,30 @@ db -4, 28, 46, -6 db -2, 16, 54, -4 db -2, 10, 58, -2 +ALIGN 32 +tab_ChromaCoeff_V: times 8 db 0, 64 + times 8 db 0, 0 + + times 8 db -2, 58 + times 8 db 10, -2 + + times 8 db -4, 54 + times 8 db 16, -2 + + times 8 db -6, 46 + times 8 db 28, -4 + + times 8 db -4, 36 + times 8 db 36, -4 + + times 8 db -4, 28 + times 8 db 46, -6 + + times 8 db -2, 16 + times 8 db 54, -4 + + times 8 db -2, 10 + times 8 db 58, -2 tab_ChromaCoeffV: times 4 dw 0, 64 times 4 dw 0, 0 @@ -84,6 +120,31 @@ times 4 dw -2, 10 times 4 dw 58, -2 +ALIGN 32 +pw_ChromaCoeffV: times 8 dw 0, 64 + times 8 dw 0, 0 + + times 8 dw -2, 58 + times 8 dw 10, -2 + + times 8 dw -4, 54 + times 8 dw 16, -2 + + times 8 dw -6, 46 + times 8 dw 28, -4 + + times 8 dw -4, 36 + times 8 dw 36, -4 + + times 8 dw -4, 28 + times 8 dw 46, -6 + + times 8 dw -2, 16 + times 8 dw 54, -4 + + times 8 dw -2, 10 + times 8 dw 58, -2 + tab_LumaCoeff: db 0, 0, 0, 64, 0, 0, 0, 0 db -1, 4, -10, 58, 17, -5, 1, 0 db -1, 4, -11, 40, 40, -11, 4, -1 @@ -109,6 +170,47 @@ times 4 dw 58, -10 times 4 dw 4, -1 +ALIGN 32 +pw_LumaCoeffVer: times 8 dw 0, 0 + times 8 dw 0, 64 + times 8 dw 0, 0 + times 8 dw 0, 0 + + times 8 dw -1, 4 + times 8 dw -10, 58 + times 8 dw 17, -5 + times 8 dw 1, 0 + + times 8 dw -1, 4 + times 8 dw -11, 40 + times 8 dw 40, -11 + times 8 dw 4, -1 + + times 8 dw 0, 1 + times 8 dw -5, 17 + times 8 dw 58, -10 + times 8 dw 4, -1 + +pb_LumaCoeffVer: times 16 db 0, 0 + times 16 db 0, 64 + times 16 db 0, 0 + times 16 db 0, 0 + + times 16 db -1, 4 + times 16 db -10, 58 + times 16 db 17, -5 + times 16 db 1, 0 + + times 16 db -1, 4 + times 16 db -11, 40 + times 16 db 40, -11 + times 16 db 4, -1 + + times 16 db 0, 1 + times 16 db -5, 17 + times 16 db 58, -10 + times 16 db 4, -1 + tab_LumaCoeffVer: times 8 db 0, 0 times 8 db 0, 64 times 8 db 0, 0 @@ -183,6 +285,15 @@ interp4_horiz_shuf1: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6 db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 +ALIGN 32 +interp4_hpp_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12 + +ALIGN 32 +interp8_hps_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7 + +ALIGN 32 +interp4_hps_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12 + SECTION .text cextern pb_128 @@ -913,6 +1024,105 @@ pextrd [r2+r0], xm3, 3 RET +%macro FILTER_HORIZ_LUMA_AVX2_4xN 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_pp_4x%1, 4, 6, 9 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4 * 8] +%endif + + mova m1, [tab_Lm] + mova m2, [pw_1] + mova m7, [interp8_hps_shuf] + mova m8, [pw_512] + + ; register map + ; m0 - interpolate coeff + ; m1 - shuffle order table + ; m2 - constant word 1 + lea r4, [r1 * 3] + lea r5, [r3 * 3] + sub r0, 3 +%rep %1 / 8 + ; Row 0-1 + vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m3, m1 + pmaddubsw m3, m0 + pmaddwd m3, m2 + vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] + pshufb m4, m1 + pmaddubsw m4, m0 + pmaddwd m4, m2 + phaddd m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A] + + ; Row 2-3 + vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
View file
x265_1.5.tar.gz/source/common/x86/ipfilter8.h -> x265_1.6.tar.gz/source/common/x86/ipfilter8.h
Changed
@@ -576,8 +576,12 @@ CHROMA_420_FILTERS(_avx2); CHROMA_420_SP_FILTERS(_sse2); CHROMA_420_SP_FILTERS_SSE4(_sse4); +CHROMA_420_SP_FILTERS(_avx2); +CHROMA_420_SP_FILTERS_SSE4(_avx2); CHROMA_420_SS_FILTERS(_sse2); CHROMA_420_SS_FILTERS_SSE4(_sse4); +CHROMA_420_SS_FILTERS(_avx2); +CHROMA_420_SS_FILTERS_SSE4(_avx2); CHROMA_422_FILTERS(_sse4); CHROMA_422_FILTERS(_avx2); @@ -617,10 +621,31 @@ LUMA_SP_FILTERS(_sse4); LUMA_SS_FILTERS(_sse2); LUMA_FILTERS(_avx2); - +LUMA_SP_FILTERS(_avx2); +LUMA_SS_FILTERS(_avx2); void x265_interp_8tap_hv_pp_8x8_sse4(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); -void x265_luma_p2s_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height); - +void x265_pixelToShort_4x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_4x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_4x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); +void x265_pixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst); #undef LUMA_FILTERS #undef LUMA_SP_FILTERS #undef LUMA_SS_FILTERS
View file
x265_1.5.tar.gz/source/common/x86/mc-a.asm -> x265_1.6.tar.gz/source/common/x86/mc-a.asm
Changed
@@ -1759,7 +1759,570 @@ ADDAVG_W16_H4 24 ;----------------------------------------------------------------------------- +; addAvg avx2 code start +;----------------------------------------------------------------------------- + +INIT_YMM avx2 +cglobal addAvg_8x2, 6,6,4, pSrc0, src0, src1, dst, src0Stride, src1tride, dstStride + movu xm0, [r0] + vinserti128 m0, m0, [r0 + 2 * r3], 1 + + movu xm2, [r1] + vinserti128 m2, m2, [r1 + 2 * r4], 1 + + paddw m0, m2 + pmulhrsw m0, [pw_256] + paddw m0, [pw_128] + + packuswb m0, m0 + vextracti128 xm1, m0, 1 + movq [r2], xm0 + movq [r2 + r5], xm1 + RET + +cglobal addAvg_8x6, 6,6,6, pSrc0, src0, src1, dst, src0Stride, src1tride, dstStride + mova m4, [pw_256] + mova m5, [pw_128] + add r3, r3 + add r4, r4 + + movu xm0, [r0] + vinserti128 m0, m0, [r0 + r3], 1 + + movu xm2, [r1] + vinserti128 m2, m2, [r1 + r4], 1 + + paddw m0, m2 + pmulhrsw m0, m4 + paddw m0, m5 + + packuswb m0, m0 + vextracti128 xm1, m0, 1 + movq [r2], xm0 + movq [r2 + r5], xm1 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] + + movu xm0, [r0] + vinserti128 m0, m0, [r0+ r3], 1 + + movu xm2, [r1] + vinserti128 m2, m2, [r1 + r4], 1 + + paddw m0, m2 + pmulhrsw m0, m4 + paddw m0, m5 + + packuswb m0, m0 + vextracti128 xm1, m0, 1 + movq [r2], xm0 + movq [r2 + r5], xm1 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] + + movu xm0, [r0] + vinserti128 m0, m0, [r0 + r3], 1 + + movu xm2, [r1] + vinserti128 m2, m2, [r1 + r4], 1 + + paddw m0, m2 + pmulhrsw m0, m4 + paddw m0, m5 + + packuswb m0, m0 + vextracti128 xm1, m0, 1 + movq [r2], xm0 + movq [r2 + r5], xm1 + RET + +%macro ADDAVG_W8_H4_AVX2 1 +INIT_YMM avx2 +cglobal addAvg_8x%1, 6,7,6, pSrc0, src0, src1, dst, src0Stride, src1tride, dstStride + mova m4, [pw_256] + mova m5, [pw_128] + add r3, r3 + add r4, r4 + mov r6d, %1/4 + +.loop: + movu xm0, [r0] + vinserti128 m0, m0, [r0 + r3], 1 + + movu xm2, [r1] + vinserti128 m2, m2, [r1 + r4], 1 + + paddw m0, m2 + pmulhrsw m0, m4 + paddw m0, m5 + + packuswb m0, m0 + vextracti128 xm1, m0, 1 + movq [r2], xm0 + movq [r2 + r5], xm1 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] + + movu xm0, [r0] + vinserti128 m0, m0, [r0 + r3], 1 + + movu m2, [r1] + vinserti128 m2, m2, [r1 + r4], 1 + + paddw m0, m2 + pmulhrsw m0, m4 + paddw m0, m5 + + packuswb m0, m0 + vextracti128 xm1, m0, 1 + movq [r2], xm0 + movq [r2 + r5], xm1 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] + + dec r6d + jnz .loop + RET +%endmacro +ADDAVG_W8_H4_AVX2 4 +ADDAVG_W8_H4_AVX2 8 +ADDAVG_W8_H4_AVX2 16 +ADDAVG_W8_H4_AVX2 32 + +%macro ADDAVG_W12_H4_AVX2 1 +INIT_YMM avx2 +cglobal addAvg_12x%1, 6,7,7, pSrc0, src0, src1, dst, src0Stride, src1tride, dstStride + mova m4, [pw_256] + mova m5, [pw_128] + add r3, r3 + add r4, r4 + mov r6d, %1/4 + +.loop: + movu xm0, [r0] + movu xm1, [r1] + movq xm2, [r0 + 16] + movq xm3, [r1 + 16] + vinserti128 m0, m0, xm2, 1 + vinserti128 m1, m1, xm3, 1 + + paddw m0, m1 + pmulhrsw m0, m4 + paddw m0, m5 + + movu xm1, [r0 + r3] + movu xm2, [r1 + r4] + movq xm3, [r0 + r3 + 16] + movq xm6, [r1 + r3 + 16] + vinserti128 m1, m1, xm3, 1 + vinserti128 m2, m2, xm6, 1 + + paddw m1, m2 + pmulhrsw m1, m4 + paddw m1, m5 + + packuswb m0, m1 + vextracti128 xm1, m0, 1 + movq [r2], xm0 + movd [r2 + 8], xm1 + vpshufd m1, m1, 2 + movhps [r2 + r5], xm0 + movd [r2 + r5 + 8], xm1 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] + + movu xm0, [r0] + movu xm1, [r1] + movq xm2, [r0 + 16] + movq xm3, [r1 + 16] + vinserti128 m0, m0, xm2, 1 + vinserti128 m1, m1, xm3, 1 + + paddw m0, m1 + pmulhrsw m0, m4 + paddw m0, m5 + + movu xm1, [r0 + r3] + movu xm2, [r1 + r4]
View file
x265_1.5.tar.gz/source/common/x86/pixel-a.asm -> x265_1.6.tar.gz/source/common/x86/pixel-a.asm
Changed
@@ -38,13 +38,15 @@ times 4 db 1, -1 times 8 db 1 times 4 db 1, -1 -hmul_4p: times 2 db 1, 1, 1, 1, 1, -1, 1, -1 +hmul_4p: times 4 db 1, 1, 1, 1, 1, -1, 1, -1 mask_10: times 4 dw 0, -1 mask_1100: times 2 dd 0, -1 hmul_8w: times 4 dw 1 times 2 dw 1, -1 + times 4 dw 1 + times 2 dw 1, -1 ALIGN 32 -hmul_w: dw 1, -1, 1, -1, 1, -1, 1, -1 +hmul_w: times 2 dw 1, -1, 1, -1, 1, -1, 1, -1 ALIGN 32 transd_shuf1: SHUFFLE_MASK_W 0, 8, 2, 10, 4, 12, 6, 14 transd_shuf2: SHUFFLE_MASK_W 1, 9, 3, 11, 5, 13, 7, 15 @@ -1235,6 +1237,580 @@ RET %else +%if WIN64 +cglobal pixel_satd_16x24, 4,8,14 ;if WIN64 && cpuflag(avx) + SATD_START_SSE2 m6, m7 + mov r6, r0 + mov r7, r2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + lea r0, [r6 + 8*SIZEOF_PIXEL] + lea r2, [r7 + 8*SIZEOF_PIXEL] + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + pxor m7, m7 + movhlps m7, m6 + paddd m6, m7 + pshufd m7, m6, 1 + paddd m6, m7 + movd eax, m6 + RET +%else +cglobal pixel_satd_16x24, 4,7,8,0-gprsize ;if !WIN64 + SATD_START_SSE2 m6, m7 + mov r6, r0 + mov [rsp], r2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + lea r0, [r6 + 8*SIZEOF_PIXEL] + mov r2, [rsp] + add r2, 8*SIZEOF_PIXEL + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + pxor m7, m7 + movhlps m7, m6 + paddd m6, m7 + pshufd m7, m6, 1 + paddd m6, m7 + movd eax, m6 + RET +%endif +%if WIN64 +cglobal pixel_satd_32x48, 4,8,14 ;if WIN64 && cpuflag(avx) + SATD_START_SSE2 m6, m7 + mov r6, r0 + mov r7, r2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + lea r0, [r6 + 8*SIZEOF_PIXEL] + lea r2, [r7 + 8*SIZEOF_PIXEL] + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + lea r0, [r6 + 16*SIZEOF_PIXEL] + lea r2, [r7 + 16*SIZEOF_PIXEL] + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + lea r0, [r6 + 24*SIZEOF_PIXEL] + lea r2, [r7 + 24*SIZEOF_PIXEL] + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + pxor m7, m7 + movhlps m7, m6 + paddd m6, m7 + pshufd m7, m6, 1 + paddd m6, m7 + movd eax, m6 + RET +%else +cglobal pixel_satd_32x48, 4,7,8,0-gprsize ;if !WIN64 + SATD_START_SSE2 m6, m7 + mov r6, r0 + mov [rsp], r2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + lea r0, [r6 + 8*SIZEOF_PIXEL] + mov r2, [rsp] + add r2, 8*SIZEOF_PIXEL + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + lea r0, [r6 + 16*SIZEOF_PIXEL] + mov r2, [rsp] + add r2, 16*SIZEOF_PIXEL + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + lea r0, [r6 + 24*SIZEOF_PIXEL] + mov r2, [rsp] + add r2, 24*SIZEOF_PIXEL + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + pxor m7, m7 + movhlps m7, m6 + paddd m6, m7 + pshufd m7, m6, 1 + paddd m6, m7 + movd eax, m6 + RET +%endif + +%if WIN64 +cglobal pixel_satd_24x64, 4,8,14 ;if WIN64 && cpuflag(avx) + SATD_START_SSE2 m6, m7 + mov r6, r0 + mov r7, r2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + lea r0, [r6 + 8*SIZEOF_PIXEL] + lea r2, [r7 + 8*SIZEOF_PIXEL] + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + lea r0, [r6 + 16*SIZEOF_PIXEL] + lea r2, [r7 + 16*SIZEOF_PIXEL] + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2 + pxor m7, m7 + movhlps m7, m6 + paddd m6, m7 + pshufd m7, m6, 1 + paddd m6, m7 + movd eax, m6 + RET +%else +cglobal pixel_satd_24x64, 4,7,8,0-gprsize ;if !WIN64 + SATD_START_SSE2 m6, m7 + mov r6, r0 + mov [rsp], r2 + call pixel_satd_8x8_internal2 + call pixel_satd_8x8_internal2
View file
x265_1.5.tar.gz/source/common/x86/pixel-util.h -> x265_1.6.tar.gz/source/common/x86/pixel-util.h
Changed
@@ -30,6 +30,8 @@ void x265_getResidual16_sse4(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); void x265_getResidual32_sse2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); void x265_getResidual32_sse4(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); +void x265_getResidual16_avx2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); +void x265_getResidual32_avx2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); void x265_transpose4_sse2(pixel* dest, const pixel* src, intptr_t stride); void x265_transpose8_sse2(pixel* dest, const pixel* src, intptr_t stride); @@ -48,7 +50,15 @@ uint32_t x265_nquant_avx2(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff); void x265_dequant_normal_sse4(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift); void x265_dequant_normal_avx2(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift); -int x265_count_nonzero_ssse3(const int16_t* quantCoeff, int numCoeff); + +int x265_count_nonzero_4x4_ssse3(const int16_t* quantCoeff); +int x265_count_nonzero_8x8_ssse3(const int16_t* quantCoeff); +int x265_count_nonzero_16x16_ssse3(const int16_t* quantCoeff); +int x265_count_nonzero_32x32_ssse3(const int16_t* quantCoeff); +int x265_count_nonzero_4x4_avx2(const int16_t* quantCoeff); +int x265_count_nonzero_8x8_avx2(const int16_t* quantCoeff); +int x265_count_nonzero_16x16_avx2(const int16_t* quantCoeff); +int x265_count_nonzero_32x32_avx2(const int16_t* quantCoeff); void x265_weight_pp_sse4(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset); void x265_weight_pp_avx2(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset); @@ -67,6 +77,8 @@ void x265_scale1D_128to64_avx2(pixel*, const pixel*, intptr_t); void x265_scale2D_64to32_ssse3(pixel*, const pixel*, intptr_t); +int x265_findPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig); + #define SETUP_CHROMA_PIXELSUB_PS_FUNC(W, H, cpu) \ void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t* dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \ void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t* scr1, intptr_t srcStride0, intptr_t srcStride1);
View file
x265_1.5.tar.gz/source/common/x86/pixel-util8.asm -> x265_1.6.tar.gz/source/common/x86/pixel-util8.asm
Changed
@@ -3,6 +3,7 @@ ;* ;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com> ;* Nabajit Deka <nabajit@multicorewareinc.com> +;* Rajesh Paulraj <rajesh@multicorewareinc.com> ;* ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by @@ -63,6 +64,12 @@ cextern pd_1 cextern pd_32767 cextern pd_n32768 +cextern pb_2 +cextern pb_4 +cextern pb_8 +cextern pb_16 +cextern pb_32 +cextern pb_64 ;----------------------------------------------------------------------------- ; void getResidual(pixel *fenc, pixel *pred, int16_t *residual, intptr_t stride) @@ -95,9 +102,9 @@ punpcklqdq m0, m1 punpcklqdq m2, m3 psubw m0, m2 - movh [r2], m0 movhps [r2 + r3], m0 + RET %else cglobal getResidual4, 4,4,5 pxor m0, m0 @@ -130,8 +137,8 @@ psubw m1, m3 movh [r2], m1 movhps [r2 + r3 * 2], m1 -%endif RET +%endif INIT_XMM sse2 @@ -157,6 +164,7 @@ lea r2, [r2 + r3 * 2] %endif %endrep + RET %else cglobal getResidual8, 4,4,5 pxor m0, m0 @@ -183,8 +191,9 @@ lea r2, [r2 + r3 * 4] %endif %endrep -%endif RET +%endif + %if HIGH_BIT_DEPTH INIT_XMM sse2 @@ -238,10 +247,9 @@ lea r0, [r0 + r3 * 2] lea r1, [r1 + r3 * 2] lea r2, [r2 + r3 * 2] - jnz .loop + RET %else - INIT_XMM sse4 cglobal getResidual16, 4,5,8 mov r4d, 16/4 @@ -302,11 +310,67 @@ lea r0, [r0 + r3 * 2] lea r1, [r1 + r3 * 2] lea r2, [r2 + r3 * 4] - jnz .loop + RET %endif +%if HIGH_BIT_DEPTH +INIT_YMM avx2 +cglobal getResidual16, 4,4,5 + add r3, r3 + pxor m0, m0 + +%assign x 0 +%rep 16/2 + movu m1, [r0] + movu m2, [r0 + r3] + movu m3, [r1] + movu m4, [r1 + r3] + + psubw m1, m3 + psubw m2, m4 + movu [r2], m1 + movu [r2 + r3], m2 +%assign x x+1 +%if (x != 8) + lea r0, [r0 + r3 * 2] + lea r1, [r1 + r3 * 2] + lea r2, [r2 + r3 * 2] +%endif +%endrep RET +%else +INIT_YMM avx2 +cglobal getResidual16, 4,5,8 + lea r4, [r3 * 2] + add r4d, r3d +%assign x 0 +%rep 4 + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + r3] + pmovzxbw m2, [r0 + r3 * 2] + pmovzxbw m3, [r0 + r4] + pmovzxbw m4, [r1] + pmovzxbw m5, [r1 + r3] + pmovzxbw m6, [r1 + r3 * 2] + pmovzxbw m7, [r1 + r4] + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + movu [r2], m0 + movu [r2 + r3 * 2], m1 + movu [r2 + r3 * 2 * 2], m2 + movu [r2 + r4 * 2], m3 +%assign x x+1 +%if (x != 4) + lea r0, [r0 + r3 * 2 * 2] + lea r1, [r1 + r3 * 2 * 2] + lea r2, [r2 + r3 * 4 * 2] +%endif +%endrep + RET +%endif %if HIGH_BIT_DEPTH INIT_XMM sse2 @@ -357,9 +421,8 @@ lea r0, [r0 + r3 * 2] lea r1, [r1 + r3 * 2] lea r2, [r2 + r3 * 2] - jnz .loop - + RET %else INIT_XMM sse4 cglobal getResidual32, 4,5,7 @@ -415,12 +478,70 @@ lea r0, [r0 + r3 * 2] lea r1, [r1 + r3 * 2] lea r2, [r2 + r3 * 4] - jnz .loop + RET +%endif + + +%if HIGH_BIT_DEPTH +INIT_YMM avx2 +cglobal getResidual32, 4,4,5 + add r3, r3 + pxor m0, m0 + +%assign x 0 +%rep 32 + movu m1, [r0] + movu m2, [r0 + 32] + movu m3, [r1] + movu m4, [r1 + 32] + + psubw m1, m3 + psubw m2, m4 + movu [r2], m1 + movu [r2 + 32], m2 +%assign x x+1 +%if (x != 32) + lea r0, [r0 + r3] + lea r1, [r1 + r3] + lea r2, [r2 + r3] %endif +%endrep RET +%else +INIT_YMM avx2 +cglobal getResidual32, 4,5,8 + lea r4, [r3 * 2] +%assign x 0 +%rep 16 + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + 16] + pmovzxbw m2, [r0 + r3] + pmovzxbw m3, [r0 + r3 + 16] + + pmovzxbw m4, [r1]
View file
x265_1.5.tar.gz/source/common/x86/pixel.h -> x265_1.6.tar.gz/source/common/x86/pixel.h
Changed
@@ -103,6 +103,13 @@ DECL_X1(satd, avx) DECL_X1(satd, xop) DECL_X1(satd, avx2) +int x265_pixel_satd_16x24_avx(const pixel*, intptr_t, const pixel*, intptr_t); +int x265_pixel_satd_32x48_avx(const pixel*, intptr_t, const pixel*, intptr_t); +int x265_pixel_satd_24x64_avx(const pixel*, intptr_t, const pixel*, intptr_t); +int x265_pixel_satd_8x64_avx(const pixel*, intptr_t, const pixel*, intptr_t); +int x265_pixel_satd_8x12_avx(const pixel*, intptr_t, const pixel*, intptr_t); +int x265_pixel_satd_12x32_avx(const pixel*, intptr_t, const pixel*, intptr_t); +int x265_pixel_satd_4x32_avx(const pixel*, intptr_t, const pixel*, intptr_t); int x265_pixel_satd_8x32_sse2(const pixel*, intptr_t, const pixel*, intptr_t); int x265_pixel_satd_16x4_sse2(const pixel*, intptr_t, const pixel*, intptr_t); int x265_pixel_satd_16x12_sse2(const pixel*, intptr_t, const pixel*, intptr_t); @@ -170,10 +177,12 @@ int x265_pixel_ssd_s_8_sse2(const int16_t*, intptr_t); int x265_pixel_ssd_s_16_sse2(const int16_t*, intptr_t); int x265_pixel_ssd_s_32_sse2(const int16_t*, intptr_t); +int x265_pixel_ssd_s_16_avx2(const int16_t*, intptr_t); int x265_pixel_ssd_s_32_avx2(const int16_t*, intptr_t); #define ADDAVG(func) \ - void x265_ ## func ## _sse4(const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); + void x265_ ## func ## _sse4(const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \ + void x265_ ## func ## _avx2(const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); ADDAVG(addAvg_2x4) ADDAVG(addAvg_2x8) ADDAVG(addAvg_4x2); @@ -228,6 +237,41 @@ int x265_psyCost_ss_16x16_sse4(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); int x265_psyCost_ss_32x32_sse4(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); int x265_psyCost_ss_64x64_sse4(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); +void x265_pixel_avg_16x4_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_16x8_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_16x12_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_16x16_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_16x32_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_16x64_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_32x64_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_32x32_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_32x24_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_32x16_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_32x8_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_64x64_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_64x48_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_64x32_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_64x16_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); + +void x265_pixel_add_ps_16x16_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_add_ps_32x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_add_ps_64x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); + +void x265_pixel_sub_ps_16x16_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_sub_ps_32x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_sub_ps_64x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); + +int x265_psyCost_pp_4x4_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); +int x265_psyCost_pp_8x8_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); +int x265_psyCost_pp_16x16_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); +int x265_psyCost_pp_32x32_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); +int x265_psyCost_pp_64x64_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); + +int x265_psyCost_ss_4x4_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); +int x265_psyCost_ss_8x8_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); +int x265_psyCost_ss_16x16_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); +int x265_psyCost_ss_32x32_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); +int x265_psyCost_ss_64x64_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); #undef DECL_PIXELS #undef DECL_HEVC_SSD
View file
x265_1.5.tar.gz/source/common/x86/pixeladd8.asm -> x265_1.6.tar.gz/source/common/x86/pixeladd8.asm
Changed
@@ -398,6 +398,52 @@ jnz .loop RET + +INIT_YMM avx2 +cglobal pixel_add_ps_16x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1 + mov r6d, %2/4 + add r5, r5 +.loop: + + pmovzxbw m0, [r2] ; row 0 of src0 + pmovzxbw m1, [r2 + r4] ; row 1 of src0 + movu m2, [r3] ; row 0 of src1 + movu m3, [r3 + r5] ; row 1 of src1 + paddw m0, m2 + paddw m1, m3 + packuswb m0, m1 + + lea r2, [r2 + r4 * 2] + lea r3, [r3 + r5 * 2] + + pmovzxbw m2, [r2] ; row 2 of src0 + pmovzxbw m3, [r2 + r4] ; row 3 of src0 + movu m4, [r3] ; row 2 of src1 + movu m5, [r3 + r5] ; row 3 of src1 + paddw m2, m4 + paddw m3, m5 + packuswb m2, m3 + + lea r2, [r2 + r4 * 2] + lea r3, [r3 + r5 * 2] + + vpermq m0, m0, 11011000b + movu [r0], xm0 ; row 0 of dst + vextracti128 xm3, m0, 1 + movu [r0 + r1], xm3 ; row 1 of dst + + lea r0, [r0 + r1 * 2] + vpermq m2, m2, 11011000b + movu [r0], xm2 ; row 2 of dst + vextracti128 xm3, m2, 1 + movu [r0 + r1], xm3 ; row 3 of dst + + lea r0, [r0 + r1 * 2] + + dec r6d + jnz .loop + + RET %endif %endmacro @@ -523,6 +569,67 @@ jnz .loop RET + +INIT_YMM avx2 +cglobal pixel_add_ps_32x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1 + mov r6d, %2/4 + add r5, r5 +.loop: + pmovzxbw m0, [r2] ; first half of row 0 of src0 + pmovzxbw m1, [r2 + 16] ; second half of row 0 of src0 + movu m2, [r3] ; first half of row 0 of src1 + movu m3, [r3 + 32] ; second half of row 0 of src1 + + paddw m0, m2 + paddw m1, m3 + packuswb m0, m1 + vpermq m0, m0, 11011000b + movu [r0], m0 ; row 0 of dst + + pmovzxbw m0, [r2 + r4] ; first half of row 1 of src0 + pmovzxbw m1, [r2 + r4 + 16] ; second half of row 1 of src0 + movu m2, [r3 + r5] ; first half of row 1 of src1 + movu m3, [r3 + r5 + 32] ; second half of row 1 of src1 + + paddw m0, m2 + paddw m1, m3 + packuswb m0, m1 + vpermq m0, m0, 11011000b + movu [r0 + r1], m0 ; row 1 of dst + + lea r2, [r2 + r4 * 2] + lea r3, [r3 + r5 * 2] + lea r0, [r0 + r1 * 2] + + pmovzxbw m0, [r2] ; first half of row 2 of src0 + pmovzxbw m1, [r2 + 16] ; second half of row 2 of src0 + movu m2, [r3] ; first half of row 2 of src1 + movu m3, [r3 + 32] ; second half of row 2 of src1 + + paddw m0, m2 + paddw m1, m3 + packuswb m0, m1 + vpermq m0, m0, 11011000b + movu [r0], m0 ; row 2 of dst + + pmovzxbw m0, [r2 + r4] ; first half of row 3 of src0 + pmovzxbw m1, [r2 + r4 + 16] ; second half of row 3 of src0 + movu m2, [r3 + r5] ; first half of row 3 of src1 + movu m3, [r3 + r5 + 32] ; second half of row 3 of src1 + + paddw m0, m2 + paddw m1, m3 + packuswb m0, m1 + vpermq m0, m0, 11011000b + movu [r0 + r1], m0 ; row 3 of dst + + lea r2, [r2 + r4 * 2] + lea r3, [r3 + r5 * 2] + lea r0, [r0 + r1 * 2] + + dec r6d + jnz .loop + RET %endif %endmacro @@ -734,6 +841,60 @@ jnz .loop RET + +INIT_YMM avx2 +cglobal pixel_add_ps_64x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1 + mov r6d, %2/2 + add r5, r5 +.loop: + pmovzxbw m0, [r2] ; first 16 of row 0 of src0 + pmovzxbw m1, [r2 + 16] ; second 16 of row 0 of src0 + pmovzxbw m2, [r2 + 32] ; third 16 of row 0 of src0 + pmovzxbw m3, [r2 + 48] ; forth 16 of row 0 of src0 + movu m4, [r3] ; first 16 of row 0 of src1 + movu m5, [r3 + 32] ; second 16 of row 0 of src1 + movu m6, [r3 + 64] ; third 16 of row 0 of src1 + movu m7, [r3 + 96] ; forth 16 of row 0 of src1 + + paddw m0, m4 + paddw m1, m5 + paddw m2, m6 + paddw m3, m7 + packuswb m0, m1 + packuswb m2, m3 + vpermq m0, m0, 11011000b + movu [r0], m0 ; first 32 of row 0 of dst + vpermq m2, m2, 11011000b + movu [r0 + 32], m2 ; second 32 of row 0 of dst + + pmovzxbw m0, [r2 + r4] ; first 16 of row 1 of src0 + pmovzxbw m1, [r2 + r4 + 16] ; second 16 of row 1 of src0 + pmovzxbw m2, [r2 + r4 + 32] ; third 16 of row 1 of src0 + pmovzxbw m3, [r2 + r4 + 48] ; forth 16 of row 1 of src0 + movu m4, [r3 + r5] ; first 16 of row 1 of src1 + movu m5, [r3 + r5 + 32] ; second 16 of row 1 of src1 + movu m6, [r3 + r5 + 64] ; third 16 of row 1 of src1 + movu m7, [r3 + r5 + 96] ; forth 16 of row 1 of src1 + + paddw m0, m4 + paddw m1, m5 + paddw m2, m6 + paddw m3, m7 + packuswb m0, m1 + packuswb m2, m3 + vpermq m0, m0, 11011000b + movu [r0 + r1], m0 ; first 32 of row 1 of dst + vpermq m2, m2, 11011000b + movu [r0 + r1 + 32], m2 ; second 32 of row 1 of dst + + lea r2, [r2 + r4 * 2] + lea r3, [r3 + r5 * 2] + lea r0, [r0 + r1 * 2] + + dec r6d + jnz .loop + RET + %endif %endmacro
View file
x265_1.5.tar.gz/source/common/x86/sad-a.asm -> x265_1.6.tar.gz/source/common/x86/sad-a.asm
Changed
@@ -3710,3 +3710,749 @@ SADX34_CACHELINE_FUNC 16, 16, 64, sse2, ssse3, ssse3 SADX34_CACHELINE_FUNC 16, 8, 64, sse2, ssse3, ssse3 +%if HIGH_BIT_DEPTH==0 +INIT_YMM avx2 +cglobal pixel_sad_x3_8x4, 6,6,5 + xorps m0, m0 + xorps m1, m1 + + sub r2, r1 ; rebase on pointer r1 + sub r3, r1 + + ; row 0 + vpbroadcastq xm2, [r0 + 0 * FENC_STRIDE] + movq xm3, [r1] + movhps xm3, [r1 + r2] + movq xm4, [r1 + r3] + psadbw xm3, xm2 + psadbw xm4, xm2 + paddd xm0, xm3 + paddd xm1, xm4 + add r1, r4 + + ; row 1 + vpbroadcastq xm2, [r0 + 1 * FENC_STRIDE] + movq xm3, [r1] + movhps xm3, [r1 + r2] + movq xm4, [r1 + r3] + psadbw xm3, xm2 + psadbw xm4, xm2 + paddd xm0, xm3 + paddd xm1, xm4 + add r1, r4 + + ; row 2 + vpbroadcastq xm2, [r0 + 2 * FENC_STRIDE] + movq xm3, [r1] + movhps xm3, [r1 + r2] + movq xm4, [r1 + r3] + psadbw xm3, xm2 + psadbw xm4, xm2 + paddd xm0, xm3 + paddd xm1, xm4 + add r1, r4 + + ; row 3 + vpbroadcastq xm2, [r0 + 3 * FENC_STRIDE] + movq xm3, [r1] + movhps xm3, [r1 + r2] + movq xm4, [r1 + r3] + psadbw xm3, xm2 + psadbw xm4, xm2 + paddd xm0, xm3 + paddd xm1, xm4 + + pshufd xm0, xm0, q0020 + movq [r5 + 0], xm0 + movd [r5 + 8], xm1 + RET + +INIT_YMM avx2 +cglobal pixel_sad_x3_8x8, 6,6,5 + xorps m0, m0 + xorps m1, m1 + + sub r2, r1 ; rebase on pointer r1 + sub r3, r1 +%assign x 0 +%rep 4 + ; row 0 + vpbroadcastq xm2, [r0 + 0 * FENC_STRIDE] + movq xm3, [r1] + movhps xm3, [r1 + r2] + movq xm4, [r1 + r3] + psadbw xm3, xm2 + psadbw xm4, xm2 + paddd xm0, xm3 + paddd xm1, xm4 + add r1, r4 + + ; row 1 + vpbroadcastq xm2, [r0 + 1 * FENC_STRIDE] + movq xm3, [r1] + movhps xm3, [r1 + r2] + movq xm4, [r1 + r3] + psadbw xm3, xm2 + psadbw xm4, xm2 + paddd xm0, xm3 + paddd xm1, xm4 + +%assign x x+1 + %if x < 4 + add r1, r4 + add r0, 2 * FENC_STRIDE + %endif +%endrep + + pshufd xm0, xm0, q0020 + movq [r5 + 0], xm0 + movd [r5 + 8], xm1 + RET + +INIT_YMM avx2 +cglobal pixel_sad_x3_8x16, 6,6,5 + xorps m0, m0 + xorps m1, m1 + + sub r2, r1 ; rebase on pointer r1 + sub r3, r1 +%assign x 0 +%rep 8 + ; row 0 + vpbroadcastq xm2, [r0 + 0 * FENC_STRIDE] + movq xm3, [r1] + movhps xm3, [r1 + r2] + movq xm4, [r1 + r3] + psadbw xm3, xm2 + psadbw xm4, xm2 + paddd xm0, xm3 + paddd xm1, xm4 + add r1, r4 + + ; row 1 + vpbroadcastq xm2, [r0 + 1 * FENC_STRIDE] + movq xm3, [r1] + movhps xm3, [r1 + r2] + movq xm4, [r1 + r3] + psadbw xm3, xm2 + psadbw xm4, xm2 + paddd xm0, xm3 + paddd xm1, xm4 + +%assign x x+1 + %if x < 8 + add r1, r4 + add r0, 2 * FENC_STRIDE + %endif +%endrep + + pshufd xm0, xm0, q0020 + movq [r5 + 0], xm0 + movd [r5 + 8], xm1 + RET + +INIT_YMM avx2 +cglobal pixel_sad_x4_8x8, 7,7,5 + xorps m0, m0 + xorps m1, m1 + + sub r2, r1 ; rebase on pointer r1 + sub r3, r1 + sub r4, r1 +%assign x 0 +%rep 4 + ; row 0 + vpbroadcastq xm2, [r0 + 0 * FENC_STRIDE] + movq xm3, [r1] + movhps xm3, [r1 + r2] + movq xm4, [r1 + r3] + movhps xm4, [r1 + r4] + psadbw xm3, xm2 + psadbw xm4, xm2 + paddd xm0, xm3 + paddd xm1, xm4 + add r1, r5 + + ; row 1 + vpbroadcastq xm2, [r0 + 1 * FENC_STRIDE] + movq xm3, [r1] + movhps xm3, [r1 + r2] + movq xm4, [r1 + r3] + movhps xm4, [r1 + r4] + psadbw xm3, xm2 + psadbw xm4, xm2 + paddd xm0, xm3 + paddd xm1, xm4 + +%assign x x+1 + %if x < 4 + add r1, r5 + add r0, 2 * FENC_STRIDE + %endif +%endrep + + pshufd xm0, xm0, q0020 + pshufd xm1, xm1, q0020 + movq [r6 + 0], xm0 + movq [r6 + 8], xm1 + RET + +INIT_YMM avx2 +cglobal pixel_sad_32x8, 4,4,6 + xorps m0, m0 + xorps m5, m5 + + movu m1, [r0] ; row 0 of pix0 + movu m2, [r2] ; row 0 of pix1 + movu m3, [r0 + r1] ; row 1 of pix0 + movu m4, [r2 + r3] ; row 1 of pix1
View file
x265_1.5.tar.gz/source/common/x86/ssd-a.asm -> x265_1.6.tar.gz/source/common/x86/ssd-a.asm
Changed
@@ -822,10 +822,10 @@ %if HIGH_BIT_DEPTH == 0 %macro SSD_LOAD_FULL 5 - mova m1, [t0+%1] - mova m2, [t2+%2] - mova m3, [t0+%3] - mova m4, [t2+%4] + movu m1, [t0+%1] + movu m2, [t2+%2] + movu m3, [t0+%3] + movu m4, [t2+%4] %if %5==1 add t0, t1 add t2, t3 @@ -1094,6 +1094,8 @@ INIT_YMM avx2 SSD 16, 16 SSD 16, 8 +SSD 32, 32 +SSD 64, 64 %assign function_align 16 %endif ; !HIGH_BIT_DEPTH @@ -2548,6 +2550,35 @@ movd eax, m0 RET +INIT_YMM avx2 +cglobal pixel_ssd_s_16, 2,4,5 + add r1, r1 + lea r3, [r1 * 3] + mov r2d, 16/4 + pxor m0, m0 +.loop: + movu m1, [r0] + movu m2, [r0 + r1] + movu m3, [r0 + 2 * r1] + movu m4, [r0 + r3] + + lea r0, [r0 + r1 * 4] + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + pmaddwd m4, m4 + paddd m1, m2 + paddd m3, m4 + paddd m1, m3 + paddd m0, m1 + + dec r2d + jnz .loop + + ; calculate sum and return + HADDD m0, m1 + movd eax, xm0 + RET INIT_YMM avx2 cglobal pixel_ssd_s_32, 2,4,5
View file
x265_1.5.tar.gz/source/encoder/analysis.cpp -> x265_1.6.tar.gz/source/encoder/analysis.cpp
Changed
@@ -71,9 +71,10 @@ Analysis::Analysis() { - m_totalNumJobs = m_numAcquiredJobs = m_numCompletedJobs = 0; m_reuseIntraDataCTU = NULL; m_reuseInterDataCTU = NULL; + m_reuseRef = NULL; + m_reuseBestMergeCand = NULL; } bool Analysis::create(ThreadLocalData *tld) @@ -125,6 +126,11 @@ m_slice = ctu.m_slice; m_frame = &frame; +#if _DEBUG || CHECKED_BUILD + for (uint32_t i = 0; i <= g_maxCUDepth; i++) + for (uint32_t j = 0; j < MAX_PRED_TYPES; j++) + m_modeDepth[i].pred[j].invalidate(); +#endif invalidateContexts(0); m_quant.setQPforQuant(ctu); m_rqt[0].cur.load(initialContext); @@ -139,10 +145,13 @@ { int numPredDir = m_slice->isInterP() ? 1 : 2; m_reuseInterDataCTU = (analysis_inter_data *)m_frame->m_analysisData.interData; - reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir]; + m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir]; + m_reuseBestMergeCand = &m_reuseInterDataCTU->bestMergeCand[ctu.m_cuAddr * CUGeom::MAX_GEOMS]; } } + ProfileCUScope(ctu, totalCTUTime, totalCTUs); + uint32_t zOrder = 0; if (m_slice->m_sliceType == I_SLICE) { @@ -153,6 +162,7 @@ memcpy(&m_reuseIntraDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition); memcpy(&m_reuseIntraDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_lumaIntraDir, sizeof(uint8_t) * numPartition); memcpy(&m_reuseIntraDataCTU->partSizes[ctu.m_cuAddr * numPartition], bestCU->m_partSize, sizeof(uint8_t) * numPartition); + memcpy(&m_reuseIntraDataCTU->chromaModes[ctu.m_cuAddr * numPartition], bestCU->m_chromaIntraDir, sizeof(uint8_t) * numPartition); } } else @@ -196,14 +206,16 @@ return; else if (md.bestMode->cu.isIntra(0)) { + md.pred[PRED_LOSSLESS].initCosts(); md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom); PartSize size = (PartSize)md.pred[PRED_LOSSLESS].cu.m_partSize[0]; uint8_t* modes = md.pred[PRED_LOSSLESS].cu.m_lumaIntraDir; - checkIntra(md.pred[PRED_LOSSLESS], cuGeom, size, modes); + checkIntra(md.pred[PRED_LOSSLESS], cuGeom, size, modes, NULL); checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth); } else { + md.pred[PRED_LOSSLESS].initCosts(); md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom); md.pred[PRED_LOSSLESS].predYuv.copyFromYuv(md.bestMode->predYuv); encodeResAndCalcRdInterCU(md.pred[PRED_LOSSLESS], cuGeom); @@ -225,15 +237,16 @@ uint8_t* reuseDepth = &m_reuseIntraDataCTU->depth[parentCTU.m_cuAddr * parentCTU.m_numPartitions]; uint8_t* reuseModes = &m_reuseIntraDataCTU->modes[parentCTU.m_cuAddr * parentCTU.m_numPartitions]; char* reusePartSizes = &m_reuseIntraDataCTU->partSizes[parentCTU.m_cuAddr * parentCTU.m_numPartitions]; + uint8_t* reuseChromaModes = &m_reuseIntraDataCTU->chromaModes[parentCTU.m_cuAddr * parentCTU.m_numPartitions]; - if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.encodeIdx) + if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.absPartIdx) { m_quant.setQPforQuant(parentCTU); PartSize size = (PartSize)reusePartSizes[zOrder]; Mode& mode = size == SIZE_2Nx2N ? md.pred[PRED_INTRA] : md.pred[PRED_INTRA_NxN]; mode.cu.initSubCU(parentCTU, cuGeom); - checkIntra(mode, cuGeom, size, &reuseModes[zOrder]); + checkIntra(mode, cuGeom, size, &reuseModes[zOrder], &reuseChromaModes[zOrder]); checkBestMode(mode, depth); if (m_bTryLossless) @@ -252,13 +265,13 @@ m_quant.setQPforQuant(parentCTU); md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom); - checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL); + checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL); checkBestMode(md.pred[PRED_INTRA], depth); - if (depth == g_maxCUDepth) + if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3) { md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom); - checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL); + checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL); checkBestMode(md.pred[PRED_INTRA_NxN], depth); } @@ -286,7 +299,7 @@ const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx); if (childGeom.flags & CUGeom::PRESENT) { - m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.encodeIdx); + m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx); m_rqt[nextDepth].cur.load(*nextContext); compressIntraCU(parentCTU, childGeom, zOrder); @@ -308,203 +321,173 @@ addSplitFlagCost(*splitPred, cuGeom.depth); else updateModeCost(*splitPred); + + checkDQPForSplitPred(splitPred->cu, cuGeom); checkBestMode(*splitPred, depth); } - checkDQP(md.bestMode->cu, cuGeom); - /* Copy best data to encData CTU and recon */ md.bestMode->cu.copyToPic(depth); if (md.bestMode != &md.pred[PRED_SPLIT]) - md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, parentCTU.m_cuAddr, cuGeom.encodeIdx); + md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, parentCTU.m_cuAddr, cuGeom.absPartIdx); } -bool Analysis::findJob(int threadId) +void Analysis::PMODE::processTasks(int workerThreadId) { - /* try to acquire a CU mode to analyze */ - m_pmodeLock.acquire(); - if (m_totalNumJobs > m_numAcquiredJobs) - { - int id = m_numAcquiredJobs++; - m_pmodeLock.release(); - - ProfileScopeEvent(pmode); - parallelModeAnalysis(threadId, id); - - m_pmodeLock.acquire(); - if (++m_numCompletedJobs == m_totalNumJobs) - m_modeCompletionEvent.trigger(); - m_pmodeLock.release(); - return true; - } - else - m_pmodeLock.release(); - - m_meLock.acquire(); - if (m_totalNumME > m_numAcquiredME) - { - int id = m_numAcquiredME++; - m_meLock.release(); - - ProfileScopeEvent(pme); - parallelME(threadId, id); - - m_meLock.acquire(); - if (++m_numCompletedME == m_totalNumME) - m_meCompletionEvent.trigger(); - m_meLock.release(); - return true; - } - else - m_meLock.release(); - - return false; +#if DETAILED_CU_STATS + int fe = master.m_modeDepth[cuGeom.depth].pred[PRED_2Nx2N].cu.m_encData->m_frameEncoderID; + master.m_stats[fe].countPModeTasks++; + ScopedElapsedTime pmodeTime(master.m_stats[fe].pmodeTime); +#endif + ProfileScopeEvent(pmode); + master.processPmode(*this, master.m_tld[workerThreadId].analysis); } -void Analysis::parallelME(int threadId, int meId) +/* process pmode jobs until none remain; may be called by the master thread or by + * a bonded peer (slave) thread via pmodeTasks() */ +void Analysis::processPmode(PMODE& pmode, Analysis& slave) { - Analysis* slave; - - if (threadId == -1) - slave = this; - else + /* acquire a mode task, else exit early */ + int task; + pmode.m_lock.acquire(); + if (pmode.m_jobTotal > pmode.m_jobAcquired) { - slave = &m_tld[threadId].analysis; - slave->setQP(*m_slice, m_rdCost.m_qp); - slave->m_slice = m_slice; - slave->m_frame = m_frame; - - slave->m_me.setSourcePU(*m_curInterMode->fencYuv, m_curInterMode->cu.m_cuAddr, m_curGeom->encodeIdx, m_puAbsPartIdx, m_puWidth, m_puHeight); - slave->prepMotionCompensation(m_curInterMode->cu, *m_curGeom, m_curPart);
View file
x265_1.5.tar.gz/source/encoder/analysis.h -> x265_1.6.tar.gz/source/encoder/analysis.h
Changed
@@ -70,30 +70,43 @@ CUDataMemPool cuMemPool; }; + class PMODE : public BondedTaskGroup + { + public: + + Analysis& master; + const CUGeom& cuGeom; + int modes[MAX_PRED_TYPES]; + + PMODE(Analysis& m, const CUGeom& g) : master(m), cuGeom(g) {} + + void processTasks(int workerThreadId); + + protected: + + PMODE operator=(const PMODE&); + }; + + void processPmode(PMODE& pmode, Analysis& slave); + ModeDepth m_modeDepth[NUM_CU_DEPTH]; bool m_bTryLossless; bool m_bChromaSa8d; - /* Analysis data for load/save modes, keeps getting incremented as CTU analysis proceeds and data is consumed or read */ - analysis_intra_data* m_reuseIntraDataCTU; - analysis_inter_data* m_reuseInterDataCTU; - int32_t* reuseRef; Analysis(); + bool create(ThreadLocalData* tld); void destroy(); + Mode& compressCTU(CUData& ctu, Frame& frame, const CUGeom& cuGeom, const Entropy& initialContext); protected: - /* mode analysis distribution */ - int m_totalNumJobs; - volatile int m_numAcquiredJobs; - volatile int m_numCompletedJobs; - Lock m_pmodeLock; - Event m_modeCompletionEvent; - bool findJob(int threadId); - void parallelModeAnalysis(int threadId, int jobId); - void parallelME(int threadId, int meId); + /* Analysis data for load/save modes, keeps getting incremented as CTU analysis proceeds and data is consumed or read */ + analysis_intra_data* m_reuseIntraDataCTU; + analysis_inter_data* m_reuseInterDataCTU; + int32_t* m_reuseRef; + uint32_t* m_reuseBestMergeCand; /* full analysis for an I-slice CU */ void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder); @@ -105,7 +118,7 @@ /* measure merge and skip */ void checkMerge2Nx2N_rd0_4(Mode& skip, Mode& merge, const CUGeom& cuGeom); - void checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGeom); + void checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGeom, bool isSkipMode); /* measure inter options */ void checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize); @@ -119,9 +132,6 @@ /* add the RD cost of coding a split flag (0 or 1) to the given mode */ void addSplitFlagCost(Mode& mode, uint32_t depth); - /* update CBF flags and QP values to be internally consistent */ - void checkDQP(CUData& cu, const CUGeom& cuGeom); - /* work-avoidance heuristics for RD levels < 5 */ uint32_t topSkipMinDepth(const CUData& parentCTU, const CUGeom& cuGeom); bool recursionDepthCheck(const CUData& parentCTU, const CUGeom& cuGeom, const Mode& bestMode); @@ -129,9 +139,13 @@ /* generate residual and recon pixels for an entire CTU recursively (RD0) */ void encodeResidue(const CUData& parentCTU, const CUGeom& cuGeom); + int calculateQpforCuSize(CUData& ctu, const CUGeom& cuGeom); + /* check whether current mode is the new best */ inline void checkBestMode(Mode& mode, uint32_t depth) { + X265_CHECK(mode.ok(), "mode costs are uninitialized\n"); + ModeDepth& md = m_modeDepth[depth]; if (md.bestMode) {
View file
x265_1.5.tar.gz/source/encoder/api.cpp -> x265_1.6.tar.gz/source/encoder/api.cpp
Changed
@@ -173,6 +173,7 @@ { Encoder *encoder = static_cast<Encoder*>(enc); + encoder->stop(); encoder->printSummary(); encoder->destroy(); delete encoder; @@ -183,6 +184,8 @@ void x265_cleanup(void) { BitCost::destroy(); + CUData::s_partSet[0] = NULL; /* allow CUData to adjust to new CTU size */ + g_ctuSizeConfigured = 0; } extern "C" @@ -206,7 +209,7 @@ uint32_t numCUsInFrame = widthInCU * heightInCU; pic->analysisData.numCUsInFrame = numCUsInFrame; - pic->analysisData.numPartitions = NUM_CU_PARTITIONS; + pic->analysisData.numPartitions = NUM_4x4_PARTITIONS; } } @@ -215,3 +218,36 @@ { return x265_free(p); } + +static const x265_api libapi = +{ + &x265_param_alloc, + &x265_param_free, + &x265_param_default, + &x265_param_parse, + &x265_param_apply_profile, + &x265_param_default_preset, + &x265_picture_alloc, + &x265_picture_free, + &x265_picture_init, + &x265_encoder_open, + &x265_encoder_parameters, + &x265_encoder_headers, + &x265_encoder_encode, + &x265_encoder_get_stats, + &x265_encoder_log, + &x265_encoder_close, + &x265_cleanup, + x265_version_str, + x265_build_info_str, + x265_max_bit_depth, +}; + +extern "C" +const x265_api* x265_api_get(int bitDepth) +{ + if (bitDepth && bitDepth != X265_DEPTH) + return NULL; + + return &libapi; +}
View file
x265_1.5.tar.gz/source/encoder/dpb.cpp -> x265_1.6.tar.gz/source/encoder/dpb.cpp
Changed
@@ -104,11 +104,14 @@ if (type == X265_TYPE_B) { - // change from _R "referenced" to _N "non-referenced" NAL unit type + newFrame->m_encData->m_bHasReferences = false; + + // Adjust NAL type for unreferenced B frames (change from _R "referenced" + // to _N "non-referenced" NAL unit type) switch (slice->m_nalUnitType) { case NAL_UNIT_CODED_SLICE_TRAIL_R: - slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_TRAIL_N; + slice->m_nalUnitType = m_bTemporalSublayer ? NAL_UNIT_CODED_SLICE_TSA_N : NAL_UNIT_CODED_SLICE_TRAIL_N; break; case NAL_UNIT_CODED_SLICE_RADL_R: slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_RADL_N; @@ -120,10 +123,12 @@ break; } } - - /* m_bHasReferences starts out as true for non-B pictures, and is set to false - * once no more pictures reference it */ - newFrame->m_encData->m_bHasReferences = IS_REFERENCED(newFrame); + else + { + /* m_bHasReferences starts out as true for non-B pictures, and is set to false + * once no more pictures reference it */ + newFrame->m_encData->m_bHasReferences = true; + } m_picList.pushFront(*newFrame);
View file
x265_1.5.tar.gz/source/encoder/dpb.h -> x265_1.6.tar.gz/source/encoder/dpb.h
Changed
@@ -39,10 +39,11 @@ int m_lastIDR; int m_pocCRA; - bool m_bRefreshPending; int m_maxRefL0; int m_maxRefL1; int m_bOpenGOP; + bool m_bRefreshPending; + bool m_bTemporalSublayer; PicList m_picList; PicList m_freeList; FrameData* m_picSymFreeList; @@ -56,6 +57,7 @@ m_maxRefL0 = param->maxNumReferences; m_maxRefL1 = param->bBPyramid ? 2 : 1; m_bOpenGOP = param->bOpenGOP; + m_bTemporalSublayer = !!param->bEnableTemporalSubLayers; } ~DPB();
View file
x265_1.5.tar.gz/source/encoder/encoder.cpp -> x265_1.6.tar.gz/source/encoder/encoder.cpp
Changed
@@ -43,7 +43,7 @@ const char g_sliceTypeToChar[] = {'B', 'P', 'I'}; } -static const char *summaryCSVHeader = +static const char* summaryCSVHeader = "Command, Date/Time, Elapsed Time, FPS, Bitrate, " "Y PSNR, U PSNR, V PSNR, Global PSNR, SSIM, SSIM (dB), " "I count, I ave-QP, I kpbs, I-PSNR Y, I-PSNR U, I-PSNR V, I-SSIM (dB), " @@ -51,7 +51,7 @@ "B count, B ave-QP, B kpbs, B-PSNR Y, B-PSNR U, B-PSNR V, B-SSIM (dB), " "Version\n"; -const char* defaultAnalysisFileName = "x265_analysis.dat"; +static const char* defaultAnalysisFileName = "x265_analysis.dat"; using namespace x265; @@ -66,7 +66,6 @@ m_numLumaWPBiFrames = 0; m_numChromaWPBiFrames = 0; m_lookahead = NULL; - m_frameEncoder = NULL; m_rateControl = NULL; m_dpb = NULL; m_exportedPic = NULL; @@ -78,9 +77,12 @@ m_cuOffsetC = NULL; m_buOffsetY = NULL; m_buOffsetC = NULL; - m_threadPool = 0; - m_numThreadLocalData = 0; + m_threadPool = NULL; m_analysisFile = NULL; + for (int i = 0; i < X265_MAX_FRAME_THREADS; i++) + m_frameEncoder[i] = NULL; + + MotionEstimate::initScales(); } void Encoder::create() @@ -101,21 +103,35 @@ if (rows == 1 || cols < 3) p->bEnableWavefront = 0; - int poolThreadCount = p->poolNumThreads ? p->poolNumThreads : getCpuCount(); + bool allowPools = !p->numaPools || strcmp(p->numaPools, "none"); // Trim the thread pool if --wpp, --pme, and --pmode are disabled if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation) - poolThreadCount = 0; + allowPools = false; - if (poolThreadCount > 1) + if (!p->frameNumThreads) { - m_threadPool = ThreadPool::allocThreadPool(poolThreadCount); - poolThreadCount = m_threadPool->getThreadCount(); + // auto-detect frame threads + int cpuCount = ThreadPool::getCpuCount(); + if (!p->bEnableWavefront) + p->frameNumThreads = X265_MIN3(cpuCount, (rows + 1) / 2, X265_MAX_FRAME_THREADS); + else if (cpuCount >= 32) + p->frameNumThreads = (p->sourceHeight > 2000) ? 8 : 6; // dual-socket 10-core IvyBridge or higher + else if (cpuCount >= 16) + p->frameNumThreads = 5; // 8 HT cores, or dual socket + else if (cpuCount >= 8) + p->frameNumThreads = 3; // 4 HT cores + else if (cpuCount >= 4) + p->frameNumThreads = 2; // Dual or Quad core + else + p->frameNumThreads = 1; } - else - poolThreadCount = 0; - if (!poolThreadCount) + m_numPools = 0; + if (allowPools) + m_threadPool = ThreadPool::allocThreadPools(p, m_numPools); + + if (!m_numPools) { // issue warnings if any of these features were requested if (p->bEnableWavefront) @@ -129,31 +145,40 @@ p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = 0; } - if (!p->frameNumThreads) - { - // auto-detect frame threads - int cpuCount = getCpuCount(); - if (!p->bEnableWavefront) - p->frameNumThreads = X265_MIN(cpuCount, (rows + 1) / 2); - else if (cpuCount >= 32) - p->frameNumThreads = (p->sourceHeight > 2000) ? 8 : 6; // dual-socket 10-core IvyBridge or higher - else if (cpuCount >= 16) - p->frameNumThreads = 5; // 8 HT cores, or dual socket - else if (cpuCount >= 8) - p->frameNumThreads = 3; // 4 HT cores - else if (cpuCount >= 4) - p->frameNumThreads = 2; // Dual or Quad core - else - p->frameNumThreads = 1; - } + char buf[128]; + int len = 0; + if (p->bEnableWavefront) + len += sprintf(buf + len, "wpp(%d rows)", rows); + if (p->bDistributeModeAnalysis) + len += sprintf(buf + len, "%spmode", len ? "+" : ""); + if (p->bDistributeMotionEstimation) + len += sprintf(buf + len, "%spme ", len ? "+" : ""); + if (!len) + strcpy(buf, "none"); - x265_log(p, X265_LOG_INFO, "WPP streams / frame threads / pool : %d / %d / %d%s%s\n", - p->bEnableWavefront ? rows : 0, p->frameNumThreads, poolThreadCount, - p->bDistributeMotionEstimation ? " / pme" : "", p->bDistributeModeAnalysis ? " / pmode" : ""); + x265_log(p, X265_LOG_INFO, "frame threads / pool features : %d / %s\n", p->frameNumThreads, buf); - m_frameEncoder = new FrameEncoder[m_param->frameNumThreads]; for (int i = 0; i < m_param->frameNumThreads; i++) - m_frameEncoder[i].setThreadPool(m_threadPool); + m_frameEncoder[i] = new FrameEncoder; + + if (m_numPools) + { + for (int i = 0; i < m_param->frameNumThreads; i++) + { + int pool = i % m_numPools; + m_frameEncoder[i]->m_pool = &m_threadPool[pool]; + m_frameEncoder[i]->m_jpId = m_threadPool[pool].m_numProviders++; + m_threadPool[pool].m_jpTable[m_frameEncoder[i]->m_jpId] = m_frameEncoder[i]; + } + for (int i = 0; i < m_numPools; i++) + m_threadPool[i].start(); + } + else + { + /* CU stats and noise-reduction buffers are indexed by jpId, so it cannot be left as -1 */ + for (int i = 0; i < m_param->frameNumThreads; i++) + m_frameEncoder[i]->m_jpId = 0; + } if (!m_scalingList.init()) { @@ -168,27 +193,17 @@ m_aborted = true; m_scalingList.setupQuantMatrices(); - /* Allocate thread local data, one for each thread pool worker and - * if --no-wpp, one for each frame encoder */ - m_numThreadLocalData = poolThreadCount; - if (!m_param->bEnableWavefront) - m_numThreadLocalData += m_param->frameNumThreads; - m_threadLocalData = new ThreadLocalData[m_numThreadLocalData]; - for (int i = 0; i < m_numThreadLocalData; i++) + m_lookahead = new Lookahead(m_param, m_threadPool); + if (m_numPools) { - m_threadLocalData[i].analysis.setThreadPool(m_threadPool); - m_threadLocalData[i].analysis.initSearch(*m_param, m_scalingList); - m_threadLocalData[i].analysis.create(m_threadLocalData); + m_lookahead->m_jpId = m_threadPool[0].m_numProviders++; + m_threadPool[0].m_jpTable[m_lookahead->m_jpId] = m_lookahead; } - if (!m_param->bEnableWavefront) - for (int i = 0; i < m_param->frameNumThreads; i++) - m_frameEncoder[i].m_tld = &m_threadLocalData[poolThreadCount + i]; - - m_lookahead = new Lookahead(m_param, m_threadPool); m_dpb = new DPB(m_param); - m_rateControl = new RateControl(m_param); + m_rateControl = new RateControl(*m_param); + initVPS(&m_vps); initSPS(&m_sps); initPPS(&m_pps); @@ -229,26 +244,29 @@ } } - if (m_frameEncoder) + int numRows = (m_param->sourceHeight + g_maxCUSize - 1) / g_maxCUSize; + int numCols = (m_param->sourceWidth + g_maxCUSize - 1) / g_maxCUSize; + for (int i = 0; i < m_param->frameNumThreads; i++) { - int numRows = (m_param->sourceHeight + g_maxCUSize - 1) / g_maxCUSize; - int numCols = (m_param->sourceWidth + g_maxCUSize - 1) / g_maxCUSize; - for (int i = 0; i < m_param->frameNumThreads; i++) + if (!m_frameEncoder[i]->init(this, numRows, numCols)) { - if (!m_frameEncoder[i].init(this, numRows, numCols, i)) - { - x265_log(m_param, X265_LOG_ERROR, "Unable to initialize frame encoder, aborting\n"); - m_aborted = true;
View file
x265_1.5.tar.gz/source/encoder/encoder.h -> x265_1.6.tar.gz/source/encoder/encoder.h
Changed
@@ -70,7 +70,6 @@ class Lookahead; class RateControl; class ThreadPool; -struct ThreadLocalData; class Encoder : public x265_encoder { @@ -86,11 +85,12 @@ int64_t m_prevReorderedPts[2]; ThreadPool* m_threadPool; - FrameEncoder* m_frameEncoder; + FrameEncoder* m_frameEncoder[X265_MAX_FRAME_THREADS]; DPB* m_dpb; Frame* m_exportedPic; + int m_numPools; int m_curEncoder; /* cached PicYuv offset arrays, shared by all instances of @@ -120,14 +120,12 @@ PPS m_pps; NALList m_nalList; ScalingList m_scalingList; // quantization matrix information - int m_numThreadLocalData; int m_lastBPSEI; uint32_t m_numDelayedPic; x265_param* m_param; RateControl* m_rateControl; - ThreadLocalData* m_threadLocalData; Lookahead* m_lookahead; Window m_conformanceWindow; @@ -138,6 +136,7 @@ ~Encoder() {} void create(); + void stop(); void destroy(); int encode(const x265_picture* pic, x265_picture *pic_out); @@ -154,8 +153,6 @@ char* statsCSVString(EncStats& stat, char* buffer); - void setThreadPool(ThreadPool* p) { m_threadPool = p; } - void configure(x265_param *param); void updateVbvPlan(RateControl* rc); @@ -172,6 +169,7 @@ protected: + void initVPS(VPS *vps); void initSPS(SPS *sps); void initPPS(PPS *pps); };
View file
x265_1.5.tar.gz/source/encoder/entropy.cpp -> x265_1.6.tar.gz/source/encoder/entropy.cpp
Changed
@@ -43,6 +43,7 @@ { markValid(); m_fracBits = 0; + m_pad = 0; X265_CHECK(sizeof(m_contextState) >= sizeof(m_contextState[0]) * MAX_OFF_CTX_MOD, "context state table is too small\n"); } @@ -51,17 +52,21 @@ WRITE_CODE(0, 4, "vps_video_parameter_set_id"); WRITE_CODE(3, 2, "vps_reserved_three_2bits"); WRITE_CODE(0, 6, "vps_reserved_zero_6bits"); - WRITE_CODE(0, 3, "vps_max_sub_layers_minus1"); - WRITE_FLAG(1, "vps_temporal_id_nesting_flag"); + WRITE_CODE(vps.maxTempSubLayers - 1, 3, "vps_max_sub_layers_minus1"); + WRITE_FLAG(vps.maxTempSubLayers == 1, "vps_temporal_id_nesting_flag"); WRITE_CODE(0xffff, 16, "vps_reserved_ffff_16bits"); - codeProfileTier(vps.ptl); + codeProfileTier(vps.ptl, vps.maxTempSubLayers); WRITE_FLAG(true, "vps_sub_layer_ordering_info_present_flag"); - WRITE_UVLC(vps.maxDecPicBuffering - 1, "vps_max_dec_pic_buffering_minus1[i]"); - WRITE_UVLC(vps.numReorderPics, "vps_num_reorder_pics[i]"); - WRITE_UVLC(0, "vps_max_latency_increase_plus1[i]"); + for (uint32_t i = 0; i < vps.maxTempSubLayers; i++) + { + WRITE_UVLC(vps.maxDecPicBuffering - 1, "vps_max_dec_pic_buffering_minus1[i]"); + WRITE_UVLC(vps.numReorderPics, "vps_num_reorder_pics[i]"); + WRITE_UVLC(vps.maxLatencyIncrease + 1, "vps_max_latency_increase_plus1[i]"); + } + WRITE_CODE(0, 6, "vps_max_nuh_reserved_zero_layer_id"); WRITE_UVLC(0, "vps_max_op_sets_minus1"); WRITE_FLAG(0, "vps_timing_info_present_flag"); /* we signal timing info in SPS-VUI */ @@ -71,16 +76,16 @@ void Entropy::codeSPS(const SPS& sps, const ScalingList& scalingList, const ProfileTierLevel& ptl) { WRITE_CODE(0, 4, "sps_video_parameter_set_id"); - WRITE_CODE(0, 3, "sps_max_sub_layers_minus1"); - WRITE_FLAG(1, "sps_temporal_id_nesting_flag"); + WRITE_CODE(sps.maxTempSubLayers - 1, 3, "sps_max_sub_layers_minus1"); + WRITE_FLAG(sps.maxTempSubLayers == 1, "sps_temporal_id_nesting_flag"); - codeProfileTier(ptl); + codeProfileTier(ptl, sps.maxTempSubLayers); WRITE_UVLC(0, "sps_seq_parameter_set_id"); WRITE_UVLC(sps.chromaFormatIdc, "chroma_format_idc"); if (sps.chromaFormatIdc == X265_CSP_I444) - WRITE_FLAG(0, "separate_colour_plane_flag"); + WRITE_FLAG(0, "separate_colour_plane_flag"); WRITE_UVLC(sps.picWidthInLumaSamples, "pic_width_in_luma_samples"); WRITE_UVLC(sps.picHeightInLumaSamples, "pic_height_in_luma_samples"); @@ -101,9 +106,12 @@ WRITE_UVLC(BITS_FOR_POC - 4, "log2_max_pic_order_cnt_lsb_minus4"); WRITE_FLAG(true, "sps_sub_layer_ordering_info_present_flag"); - WRITE_UVLC(sps.maxDecPicBuffering - 1, "sps_max_dec_pic_buffering_minus1[i]"); - WRITE_UVLC(sps.numReorderPics, "sps_num_reorder_pics[i]"); - WRITE_UVLC(sps.maxLatencyIncrease + 1, "sps_max_latency_increase_plus1[i]"); + for (uint32_t i = 0; i < sps.maxTempSubLayers; i++) + { + WRITE_UVLC(sps.maxDecPicBuffering - 1, "sps_max_dec_pic_buffering_minus1[i]"); + WRITE_UVLC(sps.numReorderPics, "sps_num_reorder_pics[i]"); + WRITE_UVLC(sps.maxLatencyIncrease + 1, "sps_max_latency_increase_plus1[i]"); + } WRITE_UVLC(sps.log2MinCodingBlockSize - 3, "log2_min_coding_block_size_minus3"); WRITE_UVLC(sps.log2DiffMaxMinCodingBlockSize, "log2_diff_max_min_coding_block_size"); @@ -129,7 +137,7 @@ WRITE_FLAG(sps.bUseStrongIntraSmoothing, "sps_strong_intra_smoothing_enable_flag"); WRITE_FLAG(1, "vui_parameters_present_flag"); - codeVUI(sps.vuiParameters); + codeVUI(sps.vuiParameters, sps.maxTempSubLayers); WRITE_FLAG(0, "sps_extension_flag"); } @@ -184,7 +192,7 @@ WRITE_FLAG(0, "pps_extension_flag"); } -void Entropy::codeProfileTier(const ProfileTierLevel& ptl) +void Entropy::codeProfileTier(const ProfileTierLevel& ptl, int maxTempSubLayers) { WRITE_CODE(0, 2, "XXX_profile_space[]"); WRITE_FLAG(ptl.tierFlag, "XXX_tier_flag[]"); @@ -222,9 +230,17 @@ } WRITE_CODE(ptl.levelIdc, 8, "general_level_idc"); + + if (maxTempSubLayers > 1) + { + WRITE_FLAG(0, "sub_layer_profile_present_flag[i]"); + WRITE_FLAG(0, "sub_layer_level_present_flag[i]"); + for (int i = maxTempSubLayers - 1; i < 8 ; i++) + WRITE_CODE(0, 2, "reserved_zero_2bits"); + } } -void Entropy::codeVUI(const VUI& vui) +void Entropy::codeVUI(const VUI& vui, int maxSubTLayers) { WRITE_FLAG(vui.aspectRatioInfoPresentFlag, "aspect_ratio_info_present_flag"); if (vui.aspectRatioInfoPresentFlag) @@ -282,7 +298,7 @@ WRITE_FLAG(vui.hrdParametersPresentFlag, "vui_hrd_parameters_present_flag"); if (vui.hrdParametersPresentFlag) - codeHrdParameters(vui.hrdParameters); + codeHrdParameters(vui.hrdParameters, maxSubTLayers); WRITE_FLAG(0, "bitstream_restriction_flag"); } @@ -329,7 +345,7 @@ } } -void Entropy::codeHrdParameters(const HRDInfo& hrd) +void Entropy::codeHrdParameters(const HRDInfo& hrd, int maxSubTLayers) { WRITE_FLAG(1, "nal_hrd_parameters_present_flag"); WRITE_FLAG(0, "vcl_hrd_parameters_present_flag"); @@ -342,13 +358,16 @@ WRITE_CODE(hrd.cpbRemovalDelayLength - 1, 5, "au_cpb_removal_delay_length_minus1"); WRITE_CODE(hrd.dpbOutputDelayLength - 1, 5, "dpb_output_delay_length_minus1"); - WRITE_FLAG(1, "fixed_pic_rate_general_flag"); - WRITE_UVLC(0, "elemental_duration_in_tc_minus1"); - WRITE_UVLC(0, "cpb_cnt_minus1"); + for (int i = 0; i < maxSubTLayers; i++) + { + WRITE_FLAG(1, "fixed_pic_rate_general_flag"); + WRITE_UVLC(0, "elemental_duration_in_tc_minus1"); + WRITE_UVLC(0, "cpb_cnt_minus1"); - WRITE_UVLC(hrd.bitRateValue - 1, "bit_rate_value_minus1"); - WRITE_UVLC(hrd.cpbSizeValue - 1, "cpb_size_value_minus1"); - WRITE_FLAG(hrd.cbrFlag, "cbr_flag"); + WRITE_UVLC(hrd.bitRateValue - 1, "bit_rate_value_minus1"); + WRITE_UVLC(hrd.cpbSizeValue - 1, "cpb_size_value_minus1"); + WRITE_FLAG(hrd.cbrFlag, "cbr_flag"); + } } void Entropy::codeAUD(const Slice& slice) @@ -521,15 +540,14 @@ { const Slice* slice = ctu.m_slice; - if (depth <= slice->m_pps->maxCuDQPDepth && slice->m_pps->bUseDQP) - bEncodeDQP = true; - int cuSplitFlag = !(cuGeom.flags & CUGeom::LEAF); int cuUnsplitFlag = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY); if (!cuUnsplitFlag) { uint32_t qNumParts = cuGeom.numPartitions >> 2; + if (depth == slice->m_pps->maxCuDQPDepth && slice->m_pps->bUseDQP) + bEncodeDQP = true; for (uint32_t qIdx = 0; qIdx < 4; ++qIdx, absPartIdx += qNumParts) { const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + qIdx); @@ -539,13 +557,14 @@ return; } - // We need to split, so don't try these modes. if (cuSplitFlag) codeSplitFlag(ctu, absPartIdx, depth); if (depth < ctu.m_cuDepth[absPartIdx] && depth < g_maxCUDepth) { uint32_t qNumParts = cuGeom.numPartitions >> 2; + if (depth == slice->m_pps->maxCuDQPDepth && slice->m_pps->bUseDQP) + bEncodeDQP = true; for (uint32_t qIdx = 0; qIdx < 4; ++qIdx, absPartIdx += qNumParts) { const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + qIdx); @@ -554,6 +573,9 @@ return; } + if (depth <= slice->m_pps->maxCuDQPDepth && slice->m_pps->bUseDQP) + bEncodeDQP = true; + if (slice->m_pps->bTransquantBypassEnabled) codeCUTransquantBypassFlag(ctu.m_tqBypass[absPartIdx]); @@ -654,7 +676,7 @@ { // Encode slice finish bool bTerminateSlice = false; - if (cuAddr + (NUM_CU_PARTITIONS >> (depth << 1)) == realEndAddress)
View file
x265_1.5.tar.gz/source/encoder/entropy.h -> x265_1.6.tar.gz/source/encoder/entropy.h
Changed
@@ -142,9 +142,9 @@ void codeVPS(const VPS& vps); void codeSPS(const SPS& sps, const ScalingList& scalingList, const ProfileTierLevel& ptl); void codePPS(const PPS& pps); - void codeVUI(const VUI& vui); + void codeVUI(const VUI& vui, int maxSubTLayers); void codeAUD(const Slice& slice); - void codeHrdParameters(const HRDInfo& hrd); + void codeHrdParameters(const HRDInfo& hrd, int maxSubTLayers); void codeSliceHeader(const Slice& slice, FrameData& encData); void codeSliceHeaderWPPEntryPoints(const Slice& slice, const uint32_t *substreamSizes, uint32_t maxOffset); @@ -230,7 +230,7 @@ void writeEpExGolomb(uint32_t symbol, uint32_t count); void writeCoefRemainExGolomb(uint32_t symbol, const uint32_t absGoRice); - void codeProfileTier(const ProfileTierLevel& ptl); + void codeProfileTier(const ProfileTierLevel& ptl, int maxTempSubLayers); void codeScalingList(const ScalingList&); void codeScalingList(const ScalingList& scalingList, uint32_t sizeId, uint32_t listId);
View file
x265_1.5.tar.gz/source/encoder/frameencoder.cpp -> x265_1.6.tar.gz/source/encoder/frameencoder.cpp
Changed
@@ -39,14 +39,13 @@ void weightAnalyse(Slice& slice, Frame& frame, x265_param& param); FrameEncoder::FrameEncoder() - : WaveFront(NULL) - , m_threadActive(true) { m_prevOutputTime = x265_mdate(); - m_totalWorkerElapsedTime = 0; + m_isFrameEncoder = true; + m_threadActive = true; m_slicetypeWaitTime = 0; - m_frameEncoderID = 0; m_activeWorkerCount = 0; + m_completionCount = 0; m_bAllRowsStop = false; m_vbvResetTriggerRow = -1; m_outStreams = NULL; @@ -59,6 +58,7 @@ m_frame = NULL; m_cuGeoms = NULL; m_ctuGeomMap = NULL; + m_localTldIdx = 0; memset(&m_frameStats, 0, sizeof(m_frameStats)); memset(&m_rce, 0, sizeof(RateControlEntry)); } @@ -66,10 +66,22 @@ void FrameEncoder::destroy() { if (m_pool) - JobProvider::flush(); // ensure no worker threads are using this frame - - m_threadActive = false; - m_enable.trigger(); + { + if (!m_jpId) + { + int numTLD = m_pool->m_numWorkers; + if (!m_param->bEnableWavefront) + numTLD += m_pool->m_numProviders; + for (int i = 0; i < numTLD; i++) + m_tld[i].destroy(); + delete [] m_tld; + } + } + else + { + m_tld->destroy(); + delete m_tld; + } delete[] m_rows; delete[] m_outStreams; @@ -85,12 +97,9 @@ delete m_rce.picTimingSEI; delete m_rce.hrdTiming; } - - // wait for worker thread to exit - stop(); } -bool FrameEncoder::init(Encoder *top, int numRows, int numCols, int id) +bool FrameEncoder::init(Encoder *top, int numRows, int numCols) { m_top = top; m_param = top->m_param; @@ -99,14 +108,14 @@ m_filterRowDelay = (m_param->bEnableSAO && m_param->bSaoNonDeblocked) ? 2 : (m_param->bEnableSAO || m_param->bEnableLoopFilter ? 1 : 0); m_filterRowDelayCus = m_filterRowDelay * numCols; - m_frameEncoderID = id; m_rows = new CTURow[m_numRows]; bool ok = !!m_numRows; - int range = m_param->searchRange; /* fpel search */ - range += 1; /* diamond search range check lag */ - range += 2; /* subpel refine */ - range += NTAPS_LUMA / 2; /* subpel filter half-length */ + /* determine full motion search range */ + int range = m_param->searchRange; /* fpel search */ + range += !!(m_param->searchMethod < 2); /* diamond/hex range check lag */ + range += NTAPS_LUMA / 2; /* subpel filter half-length */ + range += 2 + MotionEstimate::hpelIterationCount(m_param->subpelRefine) / 2; /* subpel refine steps */ m_refLagRows = 1 + ((range + g_maxCUSize - 1) / g_maxCUSize); // NOTE: 2 times of numRows because both Encoder and Filter in same queue @@ -134,7 +143,6 @@ else m_param->noiseReductionIntra = m_param->noiseReductionInter = 0; - start(); return ok; } @@ -143,6 +151,7 @@ { /* Geoms only vary between CTUs in the presence of picture edges */ int maxCUSize = m_param->maxCUSize; + int minCUSize = m_param->minCUSize; int heightRem = m_param->sourceHeight & (maxCUSize - 1); int widthRem = m_param->sourceWidth & (maxCUSize - 1); int allocGeoms = 1; // body @@ -157,7 +166,7 @@ return false; // body - CUData::calcCTUGeoms(maxCUSize, maxCUSize, maxCUSize, m_cuGeoms); + CUData::calcCTUGeoms(maxCUSize, maxCUSize, maxCUSize, minCUSize, m_cuGeoms); memset(m_ctuGeomMap, 0, sizeof(uint32_t) * m_numRows * m_numCols); if (allocGeoms == 1) return true; @@ -166,7 +175,7 @@ if (widthRem) { // right - CUData::calcCTUGeoms(widthRem, maxCUSize, maxCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS); + CUData::calcCTUGeoms(widthRem, maxCUSize, maxCUSize, minCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS); for (uint32_t i = 0; i < m_numRows; i++) { uint32_t ctuAddr = m_numCols * (i + 1) - 1; @@ -177,7 +186,7 @@ if (heightRem) { // bottom - CUData::calcCTUGeoms(maxCUSize, heightRem, maxCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS); + CUData::calcCTUGeoms(maxCUSize, heightRem, maxCUSize, minCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS); for (uint32_t i = 0; i < m_numCols; i++) { uint32_t ctuAddr = m_numCols * (m_numRows - 1) + i; @@ -188,7 +197,7 @@ if (widthRem) { // corner - CUData::calcCTUGeoms(widthRem, heightRem, maxCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS); + CUData::calcCTUGeoms(widthRem, heightRem, maxCUSize, minCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS); uint32_t ctuAddr = m_numCols * m_numRows - 1; m_ctuGeomMap[ctuAddr] = countGeoms * CUGeom::MAX_GEOMS; @@ -204,7 +213,9 @@ { m_slicetypeWaitTime = x265_mdate() - m_prevOutputTime; m_frame = curFrame; - curFrame->m_encData->m_frameEncoderID = m_frameEncoderID; // Each Frame knows the ID of the FrameEncoder encoding it + m_sliceType = curFrame->m_lowres.sliceType; + curFrame->m_encData->m_frameEncoderID = m_jpId; + curFrame->m_encData->m_jobProvider = this; curFrame->m_encData->m_slice->m_mref = m_mref; if (!m_cuGeoms) @@ -219,19 +230,66 @@ void FrameEncoder::threadMain() { - THREAD_NAME("Frame", m_frameEncoderID); + THREAD_NAME("Frame", m_jpId); - // worker thread routine for FrameEncoder - do + if (m_pool) { - m_enable.wait(); // Encoder::encode() triggers this event - if (m_threadActive) + m_pool->setCurrentThreadAffinity(); + + /* the first FE on each NUMA node is responsible for allocating thread + * local data for all worker threads in that pool. If WPP is disabled, then + * each FE also needs a TLD instance */ + if (!m_jpId) { - compressFrame(); - m_done.trigger(); // FrameEncoder::getEncodedPicture() blocks for this event + int numTLD = m_pool->m_numWorkers; + if (!m_param->bEnableWavefront) + numTLD += m_pool->m_numProviders; + + m_tld = new ThreadLocalData[numTLD]; + for (int i = 0; i < numTLD; i++) + { + m_tld[i].analysis.initSearch(*m_param, m_top->m_scalingList); + m_tld[i].analysis.create(m_tld); + } + + for (int i = 0; i < m_pool->m_numProviders; i++) + { + if (m_pool->m_jpTable[i]->m_isFrameEncoder) /* ugh; over-allocation and other issues here */ + { + FrameEncoder *peer = dynamic_cast<FrameEncoder*>(m_pool->m_jpTable[i]); + peer->m_tld = m_tld; + } + } } + + if (m_param->bEnableWavefront) + m_localTldIdx = -1; // cause exception if used + else + m_localTldIdx = m_pool->m_numWorkers + m_jpId; + } + else + {
View file
x265_1.5.tar.gz/source/encoder/frameencoder.h -> x265_1.6.tar.gz/source/encoder/frameencoder.h
Changed
@@ -122,7 +122,7 @@ virtual ~FrameEncoder() {} - bool init(Encoder *top, int numRows, int numCols, int id); + virtual bool init(Encoder *top, int numRows, int numCols); void destroy(); @@ -135,8 +135,12 @@ Event m_enable; Event m_done; Event m_completionEvent; - bool m_threadActive; - int m_frameEncoderID; + int m_localTldIdx; + + volatile bool m_threadActive; + volatile bool m_bAllRowsStop; + volatile int m_completionCount; + volatile int m_vbvResetTriggerRow; uint32_t m_numRows; uint32_t m_numCols; @@ -144,9 +148,6 @@ uint32_t m_filterRowDelayCus; uint32_t m_refLagRows; - volatile bool m_bAllRowsStop; - volatile int m_vbvResetTriggerRow; - CTURow* m_rows; RateControlEntry m_rce; SEIDecodedPictureHash m_seiReconPictureDigest; @@ -177,6 +178,9 @@ int64_t m_slicetypeWaitTime; // total elapsed time waiting for decided frame int64_t m_totalWorkerElapsedTime; // total elapsed time spent by worker threads processing CTUs int64_t m_totalNoWorkerTime; // total elapsed time without any active worker threads +#if DETAILED_CU_STATS + CUStats m_cuStats; +#endif Encoder* m_top; x265_param* m_param; @@ -196,6 +200,21 @@ FrameFilter m_frameFilter; NALList m_nalList; + class WeightAnalysis : public BondedTaskGroup + { + public: + + FrameEncoder& master; + + WeightAnalysis(FrameEncoder& fe) : master(fe) {} + + void processTasks(int workerThreadId); + + protected: + + WeightAnalysis operator=(const WeightAnalysis&); + }; + protected: bool initializeGeoms(); @@ -203,9 +222,6 @@ /* analyze / compress frame, can be run in parallel within reference constraints */ void compressFrame(); - /* called by compressFrame to perform wave-front compression analysis */ - void compressCTURows(); - /* called by compressFrame to generate final per-row bitstreams */ void encodeSlice(); @@ -215,8 +231,8 @@ void noiseReductionUpdate(); /* Called by WaveFront::findJob() */ - void processRow(int row, int threadId); - void processRowEncoder(int row, ThreadLocalData& tld); + virtual void processRow(int row, int threadId); + virtual void processRowEncoder(int row, ThreadLocalData& tld); void enqueueRowEncoder(int row) { WaveFront::enqueueRow(row * 2 + 0); } void enqueueRowFilter(int row) { WaveFront::enqueueRow(row * 2 + 1); }
View file
x265_1.5.tar.gz/source/encoder/framefilter.cpp -> x265_1.6.tar.gz/source/encoder/framefilter.cpp
Changed
@@ -83,6 +83,11 @@ { ProfileScopeEvent(filterCTURow); +#if DETAILED_CU_STATS + ScopedElapsedTime filterPerfScope(m_frameEncoder->m_cuStats.loopFilterElapsedTime); + m_frameEncoder->m_cuStats.countLoopFilter++; +#endif + if (!m_param->bEnableLoopFilter && !m_param->bEnableSAO) { processRowPost(row); @@ -298,6 +303,9 @@ updateChecksum(reconPic->m_picOrg[1], m_frameEncoder->m_checksum[1], height, width, stride, row, cuHeight); updateChecksum(reconPic->m_picOrg[2], m_frameEncoder->m_checksum[2], height, width, stride, row, cuHeight); } + + if (ATOMIC_INC(&m_frameEncoder->m_completionCount) == 2 * (int)m_frameEncoder->m_numRows) + m_frameEncoder->m_completionEvent.trigger(); } static uint64_t computeSSD(pixel *fenc, pixel *rec, intptr_t stride, uint32_t width, uint32_t height) @@ -421,7 +429,7 @@ /* Original YUV restoration for CU in lossless coding */ static void origCUSampleRestoration(const CUData* cu, const CUGeom& cuGeom, Frame& frame) { - uint32_t absPartIdx = cuGeom.encodeIdx; + uint32_t absPartIdx = cuGeom.absPartIdx; if (cu->m_cuDepth[absPartIdx] > cuGeom.depth) { for (int subPartIdx = 0; subPartIdx < 4; subPartIdx++)
View file
x265_1.5.tar.gz/source/encoder/level.cpp -> x265_1.6.tar.gz/source/encoder/level.cpp
Changed
@@ -60,6 +60,7 @@ /* determine minimum decoder level required to decode the described video */ void determineLevel(const x265_param ¶m, VPS& vps) { + vps.maxTempSubLayers = param.bEnableTemporalSubLayers ? 2 : 1; if (param.bLossless) vps.ptl.profileIdc = Profile::NONE; else if (param.internalCsp == X265_CSP_I420) @@ -154,15 +155,25 @@ return; } - vps.ptl.levelIdc = levels[i].levelEnum; - vps.ptl.minCrForLevel = levels[i].minCompressionRatio; - vps.ptl.maxLumaSrForLevel = levels[i].maxLumaSamplesPerSecond; +#define CHECK_RANGE(value, main, high) (value > main && value <= high) - if (bitrate > levels[i].maxBitrateMain && bitrate <= levels[i].maxBitrateHigh && + if (CHECK_RANGE(bitrate, levels[i].maxBitrateMain, levels[i].maxBitrateHigh) && + CHECK_RANGE((uint32_t)param.rc.vbvBufferSize, levels[i].maxCpbSizeMain, levels[i].maxCpbSizeHigh) && levels[i].maxBitrateHigh != MAX_UINT) - vps.ptl.tierFlag = Level::HIGH; + { + /* If the user has not enabled high tier, continue looking to see if we can encode at a higher level, main tier */ + if (!param.bHighTier && (levels[i].levelIdc < param.levelIdc)) + continue; + else + vps.ptl.tierFlag = Level::HIGH; + } else vps.ptl.tierFlag = Level::MAIN; +#undef CHECK_RANGE + + vps.ptl.levelIdc = levels[i].levelEnum; + vps.ptl.minCrForLevel = levels[i].minCompressionRatio; + vps.ptl.maxLumaSrForLevel = levels[i].maxLumaSamplesPerSecond; break; } @@ -250,7 +261,7 @@ } if ((uint32_t)param.rc.vbvBufferSize > (highTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain)) { - param.rc.vbvMaxBitrate = highTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain; + param.rc.vbvBufferSize = highTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain; x265_log(¶m, X265_LOG_INFO, "lowering VBV buffer size to %dKb\n", param.rc.vbvBufferSize); }
View file
x265_1.5.tar.gz/source/encoder/motion.cpp -> x265_1.6.tar.gz/source/encoder/motion.cpp
Changed
@@ -59,38 +59,6 @@ int sizeScale[NUM_PU_SIZES]; #define SAD_THRESH(v) (bcost < (((v >> 4) * sizeScale[partEnum]))) -void initScales(void) -{ -#define SETUP_SCALE(W, H) \ - sizeScale[LUMA_ ## W ## x ## H] = (H * H) >> 4; - SETUP_SCALE(4, 4); - SETUP_SCALE(8, 8); - SETUP_SCALE(8, 4); - SETUP_SCALE(4, 8); - SETUP_SCALE(16, 16); - SETUP_SCALE(16, 8); - SETUP_SCALE(8, 16); - SETUP_SCALE(16, 12); - SETUP_SCALE(12, 16); - SETUP_SCALE(4, 16); - SETUP_SCALE(16, 4); - SETUP_SCALE(32, 32); - SETUP_SCALE(32, 16); - SETUP_SCALE(16, 32); - SETUP_SCALE(32, 24); - SETUP_SCALE(24, 32); - SETUP_SCALE(32, 8); - SETUP_SCALE(8, 32); - SETUP_SCALE(64, 64); - SETUP_SCALE(64, 32); - SETUP_SCALE(32, 64); - SETUP_SCALE(64, 48); - SETUP_SCALE(48, 64); - SETUP_SCALE(64, 16); - SETUP_SCALE(16, 64); -#undef SETUP_SCALE -} - /* radius 2 hexagon. repeated entries are to avoid having to compute mod6 every time. */ const MV hex2[8] = { MV(-1, -2), MV(-2, 0), MV(-1, 2), MV(1, 2), MV(2, 0), MV(1, -2), MV(-1, -2), MV(-2, 0) }; const uint8_t mod6m1[8] = { 5, 0, 1, 2, 3, 4, 5, 0 }; /* (x-1)%6 */ @@ -136,20 +104,57 @@ absPartIdx = -1; searchMethod = X265_HEX_SEARCH; subpelRefine = 2; + blockwidth = blockheight = 0; + blockOffset = 0; bChromaSATD = false; chromaSatd = NULL; } void MotionEstimate::init(int method, int refine, int csp) { - if (!sizeScale[0]) - initScales(); - searchMethod = method; subpelRefine = refine; fencPUYuv.create(FENC_STRIDE, csp); } +void MotionEstimate::initScales(void) +{ +#define SETUP_SCALE(W, H) \ + sizeScale[LUMA_ ## W ## x ## H] = (H * H) >> 4; + SETUP_SCALE(4, 4); + SETUP_SCALE(8, 8); + SETUP_SCALE(8, 4); + SETUP_SCALE(4, 8); + SETUP_SCALE(16, 16); + SETUP_SCALE(16, 8); + SETUP_SCALE(8, 16); + SETUP_SCALE(16, 12); + SETUP_SCALE(12, 16); + SETUP_SCALE(4, 16); + SETUP_SCALE(16, 4); + SETUP_SCALE(32, 32); + SETUP_SCALE(32, 16); + SETUP_SCALE(16, 32); + SETUP_SCALE(32, 24); + SETUP_SCALE(24, 32); + SETUP_SCALE(32, 8); + SETUP_SCALE(8, 32); + SETUP_SCALE(64, 64); + SETUP_SCALE(64, 32); + SETUP_SCALE(32, 64); + SETUP_SCALE(64, 48); + SETUP_SCALE(48, 64); + SETUP_SCALE(64, 16); + SETUP_SCALE(16, 64); +#undef SETUP_SCALE +} + +int MotionEstimate::hpelIterationCount(int subme) +{ + return workload[subme].hpel_iters + + workload[subme].qpel_iters / 2; +} + MotionEstimate::~MotionEstimate() { fencPUYuv.destroy();
View file
x265_1.5.tar.gz/source/encoder/motion.h -> x265_1.6.tar.gz/source/encoder/motion.h
Changed
@@ -67,6 +67,8 @@ MotionEstimate(); ~MotionEstimate(); + static void initScales(); + static int hpelIterationCount(int subme); void init(int method, int refine, int csp); /* Methods called at slice setup */
View file
x265_1.5.tar.gz/source/encoder/nal.cpp -> x265_1.6.tar.gz/source/encoder/nal.cpp
Changed
@@ -107,7 +107,7 @@ * nuh_reserved_zero_6bits 6-bits * nuh_temporal_id_plus1 3-bits */ out[bytes++] = (uint8_t)nalUnitType << 1; - out[bytes++] = 1; + out[bytes++] = 1 + (nalUnitType == NAL_UNIT_CODED_SLICE_TSA_N); /* 7.4.1 ... * Within the NAL unit, the following three-byte sequences shall not occur at
View file
x265_1.5.tar.gz/source/encoder/ratecontrol.cpp -> x265_1.6.tar.gz/source/encoder/ratecontrol.cpp
Changed
@@ -145,30 +145,6 @@ } } // end anonymous namespace -/* Compute variance to derive AC energy of each block */ -static inline uint32_t acEnergyVar(Frame *curFrame, uint64_t sum_ssd, int shift, int i) -{ - uint32_t sum = (uint32_t)sum_ssd; - uint32_t ssd = (uint32_t)(sum_ssd >> 32); - - curFrame->m_lowres.wp_sum[i] += sum; - curFrame->m_lowres.wp_ssd[i] += ssd; - return ssd - ((uint64_t)sum * sum >> shift); -} - -/* Find the energy of each block in Y/Cb/Cr plane */ -static inline uint32_t acEnergyPlane(Frame *curFrame, pixel* src, intptr_t srcStride, int bChroma, int colorFormat) -{ - if ((colorFormat != X265_CSP_I444) && bChroma) - { - ALIGN_VAR_8(pixel, pix[8 * 8]); - primitives.cu[BLOCK_8x8].copy_pp(pix, 8, src, srcStride); - return acEnergyVar(curFrame, primitives.cu[BLOCK_8x8].var(pix, 8), 6, bChroma); - } - else - return acEnergyVar(curFrame, primitives.cu[BLOCK_16x16].var(src, srcStride), 8, bChroma); -} - /* Returns the zone for the current frame */ x265_zone* RateControl::getZone() { @@ -181,138 +157,9 @@ return NULL; } -/* Find the total AC energy of each block in all planes */ -uint32_t RateControl::acEnergyCu(Frame* curFrame, uint32_t block_x, uint32_t block_y) -{ - intptr_t stride = curFrame->m_fencPic->m_stride; - intptr_t cStride = curFrame->m_fencPic->m_strideC; - intptr_t blockOffsetLuma = block_x + (block_y * stride); - int colorFormat = m_param->internalCsp; - int hShift = CHROMA_H_SHIFT(colorFormat); - int vShift = CHROMA_V_SHIFT(colorFormat); - intptr_t blockOffsetChroma = (block_x >> hShift) + ((block_y >> vShift) * cStride); - - uint32_t var; - - var = acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[0] + blockOffsetLuma, stride, 0, colorFormat); - var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[1] + blockOffsetChroma, cStride, 1, colorFormat); - var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[2] + blockOffsetChroma, cStride, 2, colorFormat); - x265_emms(); - return var; -} - -void RateControl::calcAdaptiveQuantFrame(Frame *curFrame) -{ - /* Actual adaptive quantization */ - int maxCol = curFrame->m_fencPic->m_picWidth; - int maxRow = curFrame->m_fencPic->m_picHeight; - - for (int y = 0; y < 3; y++) - { - curFrame->m_lowres.wp_ssd[y] = 0; - curFrame->m_lowres.wp_sum[y] = 0; - } - - /* Calculate Qp offset for each 16x16 block in the frame */ - int block_xy = 0; - int block_x = 0, block_y = 0; - double strength = 0.f; - if (m_param->rc.aqMode == X265_AQ_NONE || m_param->rc.aqStrength == 0) - { - /* Need to init it anyways for CU tree */ - int cuWidth = ((maxCol / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; - int cuHeight = ((maxRow / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; - int cuCount = cuWidth * cuHeight; - - if (m_param->rc.aqMode && m_param->rc.aqStrength == 0) - { - memset(curFrame->m_lowres.qpCuTreeOffset, 0, cuCount * sizeof(double)); - memset(curFrame->m_lowres.qpAqOffset, 0, cuCount * sizeof(double)); - for (int cuxy = 0; cuxy < cuCount; cuxy++) - curFrame->m_lowres.invQscaleFactor[cuxy] = 256; - } - - /* Need variance data for weighted prediction */ - if (m_param->bEnableWeightedPred || m_param->bEnableWeightedBiPred) - { - for (block_y = 0; block_y < maxRow; block_y += 16) - for (block_x = 0; block_x < maxCol; block_x += 16) - acEnergyCu(curFrame, block_x, block_y); - } - } - else - { - block_xy = 0; - double avg_adj_pow2 = 0, avg_adj = 0, qp_adj = 0; - if (m_param->rc.aqMode == X265_AQ_AUTO_VARIANCE) - { - double bit_depth_correction = pow(1 << (X265_DEPTH - 8), 0.5); - for (block_y = 0; block_y < maxRow; block_y += 16) - { - for (block_x = 0; block_x < maxCol; block_x += 16) - { - uint32_t energy = acEnergyCu(curFrame, block_x, block_y); - qp_adj = pow(energy + 1, 0.1); - curFrame->m_lowres.qpCuTreeOffset[block_xy] = qp_adj; - avg_adj += qp_adj; - avg_adj_pow2 += qp_adj * qp_adj; - block_xy++; - } - } - - avg_adj /= m_ncu; - avg_adj_pow2 /= m_ncu; - strength = m_param->rc.aqStrength * avg_adj / bit_depth_correction; - avg_adj = avg_adj - 0.5f * (avg_adj_pow2 - (11.f * bit_depth_correction)) / avg_adj; - } - else - strength = m_param->rc.aqStrength * 1.0397f; - - block_xy = 0; - for (block_y = 0; block_y < maxRow; block_y += 16) - { - for (block_x = 0; block_x < maxCol; block_x += 16) - { - if (m_param->rc.aqMode == X265_AQ_AUTO_VARIANCE) - { - qp_adj = curFrame->m_lowres.qpCuTreeOffset[block_xy]; - qp_adj = strength * (qp_adj - avg_adj); - } - else - { - uint32_t energy = acEnergyCu(curFrame, block_x, block_y); - qp_adj = strength * (X265_LOG2(X265_MAX(energy, 1)) - (14.427f + 2 * (X265_DEPTH - 8))); - } - curFrame->m_lowres.qpAqOffset[block_xy] = qp_adj; - curFrame->m_lowres.qpCuTreeOffset[block_xy] = qp_adj; - curFrame->m_lowres.invQscaleFactor[block_xy] = x265_exp2fix8(qp_adj); - block_xy++; - } - } - } - - if (m_param->bEnableWeightedPred || m_param->bEnableWeightedBiPred) - { - int hShift = CHROMA_H_SHIFT(m_param->internalCsp); - int vShift = CHROMA_V_SHIFT(m_param->internalCsp); - maxCol = ((maxCol + 8) >> 4) << 4; - maxRow = ((maxRow + 8) >> 4) << 4; - int width[3] = { maxCol, maxCol >> hShift, maxCol >> hShift }; - int height[3] = { maxRow, maxRow >> vShift, maxRow >> vShift }; - - for (int i = 0; i < 3; i++) - { - uint64_t sum, ssd; - sum = curFrame->m_lowres.wp_sum[i]; - ssd = curFrame->m_lowres.wp_ssd[i]; - curFrame->m_lowres.wp_ssd[i] = ssd - (sum * sum + (width[i] * height[i]) / 2) / (width[i] * height[i]); - } - } -} - -RateControl::RateControl(x265_param *p) +RateControl::RateControl(x265_param& p) { - m_param = p; + m_param = &p; int lowresCuWidth = ((m_param->sourceWidth / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; int lowresCuHeight = ((m_param->sourceHeight / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; m_ncu = lowresCuWidth * lowresCuHeight; @@ -329,13 +176,11 @@ m_partialResidualCost = 0; m_rateFactorMaxIncrement = 0; m_rateFactorMaxDecrement = 0; - m_fps = m_param->fpsNum / m_param->fpsDenom; + m_fps = (double)m_param->fpsNum / m_param->fpsDenom; m_startEndOrder.set(0); m_bTerminated = false; m_finalFrameCount = 0; m_numEntries = 0; - m_amortizeFraction = 0.85; - m_amortizeFrames = 75; if (m_param->rc.rateControlMode == X265_RC_CRF) { m_param->rc.qp = (int)m_param->rc.rfConstant; @@ -371,6 +216,7 @@ m_statFileOut = NULL; m_cutreeStatFileOut = m_cutreeStatFileIn = NULL; m_rce2Pass = NULL; + m_lastBsliceSatdCost = 0; // vbv initialization m_param->rc.vbvBufferSize = x265_clip3(0, 2000000, m_param->rc.vbvBufferSize); @@ -424,11 +270,6 @@ x265_log(m_param, X265_LOG_WARNING, "strict CBR set without CBR mode, ignored\n"); m_param->rc.bStrictCbr = 0; } - if (m_param->totalFrames <= 2 * m_fps && m_param->rc.bStrictCbr) /* Strict CBR segment encode */
View file
x265_1.5.tar.gz/source/encoder/ratecontrol.h -> x265_1.6.tar.gz/source/encoder/ratecontrol.h
Changed
@@ -34,14 +34,16 @@ class Encoder; class Frame; -struct SPS; class SEIBufferingPeriod; +struct SPS; #define BASE_FRAME_DURATION 0.04 /* Arbitrary limitations as a sanity check. */ #define MAX_FRAME_DURATION 1.00 #define MIN_FRAME_DURATION 0.01 +#define MIN_AMORTIZE_FRAME 10 +#define MIN_AMORTIZE_FRACTION 0.2 #define CLIP_DURATION(f) x265_clip3(MIN_FRAME_DURATION, MAX_FRAME_DURATION, f) /* Current frame stats for 2 pass */ @@ -79,46 +81,50 @@ struct RateControlEntry { - int64_t lastSatd; /* Contains the picture cost of the previous frame, required for resetAbr and VBV */ - int sliceType; - int bframes; - int poc; - int encodeOrder; - int64_t leadingNoBSatd; - bool bLastMiniGopBFrame; - double blurredComplexity; - double qpaRc; - double qpAq; - double qRceq; - double frameSizePlanned; /* frame Size decided by RateCotrol before encoding the frame */ - double bufferRate; - double movingAvgSum; - double rowCplxrSum; - int64_t rowTotalBits; /* update cplxrsum and totalbits at the end of 2 rows */ - double qpNoVbv; - double bufferFill; - double frameDuration; - double clippedDuration; - Predictor rowPreds[3][2]; + Predictor rowPreds[3][2]; Predictor* rowPred[2]; - double frameSizeEstimated; /* hold frameSize, updated from cu level vbv rc */ - double frameSizeMaximum; /* max frame Size according to minCR restrictions and level of the video */ - bool isActive; - SEIPictureTiming *picTimingSEI; - HRDTiming *hrdTiming; + + int64_t lastSatd; /* Contains the picture cost of the previous frame, required for resetAbr and VBV */ + int64_t leadingNoBSatd; + int64_t rowTotalBits; /* update cplxrsum and totalbits at the end of 2 rows */ + double blurredComplexity; + double qpaRc; + double qpAq; + double qRceq; + double frameSizePlanned; /* frame Size decided by RateCotrol before encoding the frame */ + double bufferRate; + double movingAvgSum; + double rowCplxrSum; + double qpNoVbv; + double bufferFill; + double frameDuration; + double clippedDuration; + double frameSizeEstimated; /* hold frameSize, updated from cu level vbv rc */ + double frameSizeMaximum; /* max frame Size according to minCR restrictions and level of the video */ + int sliceType; + int bframes; + int poc; + int encodeOrder; + bool bLastMiniGopBFrame; + bool isActive; + double amortizeFrames; + double amortizeFraction; /* Required in 2-pass rate control */ - double iCuCount; - double pCuCount; - double skipCuCount; - bool keptAsRef; - double expectedVbv; - double qScale; - double newQScale; - double newQp; - int mvBits; - int miscBits; - int coeffBits; uint64_t expectedBits; /* total expected bits up to the current frame (current one excluded) */ + double iCuCount; + double pCuCount; + double skipCuCount; + double expectedVbv; + double qScale; + double newQScale; + double newQp; + int mvBits; + int miscBits; + int coeffBits; + bool keptAsRef; + + SEIPictureTiming *picTimingSEI; + HRDTiming *hrdTiming; }; class RateControl @@ -139,7 +145,7 @@ bool m_isAbrReset; int m_lastAbrResetPoc; - double m_rateTolerance; + double m_rateTolerance; double m_frameDuration; /* current frame duration in seconds */ double m_bitrate; double m_rateFactorConstant; @@ -154,33 +160,38 @@ Predictor m_pred[5]; Predictor m_predBfromP; - int m_leadingBframes; - int64_t m_bframeBits; - int64_t m_currentSatd; - int m_qpConstant[3]; - double m_ipOffset; - double m_pbOffset; - - int m_lastNonBPictType; - int64_t m_leadingNoBSatd; - - double m_cplxrSum; /* sum of bits*qscale/rceq */ - double m_wantedBitsWindow; /* target bitrate * window */ - double m_accumPQp; /* for determining I-frame quant */ - double m_accumPNorm; - double m_lastQScaleFor[3]; /* last qscale for a specific pict type, used for max_diff & ipb factor stuff */ - double m_lstep; - double m_shortTermCplxSum; - double m_shortTermCplxCount; - double m_lastRceq; - double m_qCompress; - int64_t m_totalBits; /* total bits used for already encoded frames (after ammortization) */ - int m_framesDone; /* # of frames passed through RateCotrol already */ - int64_t m_encodedBits; /* bits used for encoded frames (without ammortization) */ - double m_fps; - int64_t m_satdCostWindow[50]; - int m_sliderPos; - int64_t m_encodedBitsWindow[50]; + int64_t m_leadingNoBSatd; + double m_ipOffset; + double m_pbOffset; + int64_t m_bframeBits; + int64_t m_currentSatd; + int m_leadingBframes; + int m_qpConstant[3]; + int m_lastNonBPictType; + int m_framesDone; /* # of frames passed through RateCotrol already */ + + double m_cplxrSum; /* sum of bits*qscale/rceq */ + double m_wantedBitsWindow; /* target bitrate * window */ + double m_accumPQp; /* for determining I-frame quant */ + double m_accumPNorm; + double m_lastQScaleFor[3]; /* last qscale for a specific pict type, used for max_diff & ipb factor stuff */ + double m_lstep; + double m_shortTermCplxSum; + double m_shortTermCplxCount; + double m_lastRceq; + double m_qCompress; + int64_t m_totalBits; /* total bits used for already encoded frames (after ammortization) */ + int64_t m_encodedBits; /* bits used for encoded frames (without ammortization) */ + double m_fps; + int64_t m_satdCostWindow[50]; + int64_t m_encodedBitsWindow[50]; + int m_sliderPos; + + /* To detect a pattern of low detailed static frames in single pass ABR using satdcosts */ + int64_t m_lastBsliceSatdCost; + int m_numBframesInPattern; + bool m_isPatternPresent; + /* a common variable on which rateControlStart, rateControlEnd and rateControUpdateStats waits to * sync the calls to these functions. For example * -F2: @@ -194,24 +205,25 @@ * rceUpdate 12 * rceEnd 11 */ ThreadSafeInteger m_startEndOrder; - int m_finalFrameCount; /* set when encoder begins flushing */ - bool m_bTerminated; /* set true when encoder is closing */ + int m_finalFrameCount; /* set when encoder begins flushing */ + bool m_bTerminated; /* set true when encoder is closing */ /* hrd stuff */ SEIBufferingPeriod m_bufPeriodSEI; - double m_nominalRemovalTime; - double m_prevCpbFinalAT; + double m_nominalRemovalTime; + double m_prevCpbFinalAT; /* 2 pass */ - bool m_2pass; - FILE* m_statFileOut;
View file
x265_1.5.tar.gz/source/encoder/sao.cpp -> x265_1.6.tar.gz/source/encoder/sao.cpp
Changed
@@ -261,6 +261,8 @@ int8_t _upBuff1[MAX_CU_SIZE + 2], *upBuff1 = _upBuff1 + 1; int8_t _upBufft[MAX_CU_SIZE + 2], *upBufft = _upBufft + 1; + memset(_upBuff1 + MAX_CU_SIZE, 0, 2 * sizeof(int8_t)); /* avoid valgrind uninit warnings */ + { const pixel* recR = &rec[ctuWidth - 1]; for (int i = 0; i < ctuHeight + 1; i++)
View file
x265_1.5.tar.gz/source/encoder/search.cpp -> x265_1.6.tar.gz/source/encoder/search.cpp
Changed
@@ -30,6 +30,9 @@ #include "entropy.h" #include "rdcost.h" +#include "analysis.h" // TLD +#include "framedata.h" + using namespace x265; #if _MSC_VER @@ -40,10 +43,9 @@ #define MVP_IDX_BITS 1 -ALIGN_VAR_32(const pixel, Search::zeroPixel[MAX_CU_SIZE]) = { 0 }; ALIGN_VAR_32(const int16_t, Search::zeroShort[MAX_CU_SIZE]) = { 0 }; -Search::Search() : JobProvider(NULL) +Search::Search() { memset(m_rqt, 0, sizeof(m_rqt)); @@ -54,25 +56,30 @@ } m_numLayers = 0; + m_intraPred = NULL; + m_intraPredAngs = NULL; + m_fencScaled = NULL; + m_fencTransposed = NULL; + m_tsCoeff = NULL; + m_tsResidual = NULL; + m_tsRecon = NULL; m_param = NULL; m_slice = NULL; m_frame = NULL; - m_bJobsQueued = false; - m_totalNumME = m_numAcquiredME = m_numCompletedME = 0; } bool Search::initSearch(const x265_param& param, ScalingList& scalingList) { uint32_t maxLog2CUSize = g_log2Size[param.maxCUSize]; m_param = ¶m; - m_bEnableRDOQ = param.rdLevel >= 4; + m_bEnableRDOQ = !!param.rdoqLevel; m_bFrameParallel = param.frameNumThreads > 1; m_numLayers = g_log2Size[param.maxCUSize] - 2; m_rdCost.setPsyRdScale(param.psyRd); m_me.init(param.searchMethod, param.subpelRefine, param.internalCsp); - bool ok = m_quant.init(m_bEnableRDOQ, param.psyRdoq, scalingList, m_entropyCoder); + bool ok = m_quant.init(param.rdoqLevel, param.psyRdoq, scalingList, m_entropyCoder); if (m_param->noiseReductionIntra || m_param->noiseReductionInter) ok &= m_quant.allocNoiseReduction(param); @@ -116,6 +123,15 @@ m_qtTempTransformSkipFlag[1] = m_qtTempTransformSkipFlag[0] + numPartitions; m_qtTempTransformSkipFlag[2] = m_qtTempTransformSkipFlag[0] + numPartitions * 2; + CHECKED_MALLOC(m_intraPred, pixel, (32 * 32) * (33 + 3)); + m_fencScaled = m_intraPred + 32 * 32; + m_fencTransposed = m_fencScaled + 32 * 32; + m_intraPredAngs = m_fencTransposed + 32 * 32; + + CHECKED_MALLOC(m_tsCoeff, coeff_t, MAX_TS_SIZE * MAX_TS_SIZE); + CHECKED_MALLOC(m_tsResidual, int16_t, MAX_TS_SIZE * MAX_TS_SIZE); + CHECKED_MALLOC(m_tsRecon, pixel, MAX_TS_SIZE * MAX_TS_SIZE); + return ok; fail: @@ -141,6 +157,10 @@ X265_FREE(m_qtTempCbf[0]); X265_FREE(m_qtTempTransformSkipFlag[0]); + X265_FREE(m_intraPred); + X265_FREE(m_tsCoeff); + X265_FREE(m_tsResidual); + X265_FREE(m_tsRecon); } void Search::setQP(const Slice& slice, int qp) @@ -421,7 +441,7 @@ } // set reconstruction for next intra prediction blocks if full TU prediction won - pixel* picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.encodeIdx + absPartIdx); + pixel* picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx); intptr_t picStride = m_frame->m_reconPic->m_stride; primitives.cu[sizeIdx].copy_pp(picReconY, picStride, reconQt, reconQtStride); @@ -477,17 +497,14 @@ if (m_bEnableRDOQ) m_entropyCoder.estBit(m_entropyCoder.m_estBitsSbac, log2TrSize, true); - ALIGN_VAR_32(coeff_t, tsCoeffY[MAX_TS_SIZE * MAX_TS_SIZE]); - ALIGN_VAR_32(pixel, tsReconY[MAX_TS_SIZE * MAX_TS_SIZE]); - int checkTransformSkip = 1; for (int useTSkip = 0; useTSkip <= checkTransformSkip; useTSkip++) { uint64_t tmpCost; uint32_t tmpEnergy = 0; - coeff_t* coeff = (useTSkip ? tsCoeffY : coeffY); - pixel* tmpRecon = (useTSkip ? tsReconY : reconQt); + coeff_t* coeff = (useTSkip ? m_tsCoeff : coeffY); + pixel* tmpRecon = (useTSkip ? m_tsRecon : reconQt); uint32_t tmpReconStride = (useTSkip ? MAX_TS_SIZE : reconQtStride); primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride); @@ -578,8 +595,8 @@ if (bTSkip) { - memcpy(coeffY, tsCoeffY, sizeof(coeff_t) << (log2TrSize * 2)); - primitives.cu[sizeIdx].copy_pp(reconQt, reconQtStride, tsReconY, tuSize); + memcpy(coeffY, m_tsCoeff, sizeof(coeff_t) << (log2TrSize * 2)); + primitives.cu[sizeIdx].copy_pp(reconQt, reconQtStride, m_tsRecon, tuSize); } else if (checkTransformSkip) { @@ -589,7 +606,7 @@ } // set reconstruction for next intra prediction blocks - pixel* picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.encodeIdx + absPartIdx); + pixel* picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx); intptr_t picStride = m_frame->m_reconPic->m_stride; primitives.cu[sizeIdx].copy_pp(picReconY, picStride, reconQt, reconQtStride); @@ -639,7 +656,7 @@ uint32_t sizeIdx = log2TrSize - 2; primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride); - pixel* picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.encodeIdx + absPartIdx); + pixel* picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx); intptr_t picStride = m_frame->m_reconPic->m_stride; uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffY, log2TrSize, TEXT_LUMA, absPartIdx, false); @@ -799,7 +816,7 @@ coeff_t* coeffC = m_rqt[qtLayer].coeffRQT[chromaId] + coeffOffsetC; pixel* reconQt = m_rqt[qtLayer].reconQtYuv.getChromaAddr(chromaId, absPartIdxC); uint32_t reconQtStride = m_rqt[qtLayer].reconQtYuv.m_csize; - pixel* picReconC = m_frame->m_reconPic->getChromaAddr(chromaId, cu.m_cuAddr, cuGeom.encodeIdx + absPartIdxC); + pixel* picReconC = m_frame->m_reconPic->getChromaAddr(chromaId, cu.m_cuAddr, cuGeom.absPartIdx + absPartIdxC); intptr_t picStride = m_frame->m_reconPic->m_strideC; uint32_t chromaPredMode = cu.m_chromaIntraDir[absPartIdxC]; @@ -812,7 +829,7 @@ initAdiPatternChroma(cu, cuGeom, absPartIdxC, intraNeighbors, chromaId); // get prediction signal - predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC, m_csp); + predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC); cu.setTransformSkipPartRange(0, ttype, absPartIdxC, tuIterator.absPartIdxStep); primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride); @@ -864,9 +881,6 @@ * condition as it arrived, and to do all bit estimates from the same state. */ m_entropyCoder.store(m_rqt[fullDepth].rqtRoot); - ALIGN_VAR_32(coeff_t, tskipCoeffC[MAX_TS_SIZE * MAX_TS_SIZE]); - ALIGN_VAR_32(pixel, tskipReconC[MAX_TS_SIZE * MAX_TS_SIZE]); - uint32_t curPartNum = cuGeom.numPartitions >> tuDepthC * 2; const SplitType splitType = (m_csp == X265_CSP_I422) ? VERTICAL_SPLIT : DONT_SPLIT; @@ -903,7 +917,7 @@ chromaPredMode = g_chroma422IntraAngleMappingTable[chromaPredMode]; // get prediction signal - predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC, m_csp); + predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC); uint64_t bCost = MAX_INT64; uint32_t bDist = 0; @@ -914,8 +928,8 @@ int checkTransformSkip = 1; for (int useTSkip = 0; useTSkip <= checkTransformSkip; useTSkip++) { - coeff_t* coeff = (useTSkip ? tskipCoeffC : coeffC); - pixel* recon = (useTSkip ? tskipReconC : reconQt); + coeff_t* coeff = (useTSkip ? m_tsCoeff : coeffC); + pixel* recon = (useTSkip ? m_tsRecon : reconQt); uint32_t reconStride = (useTSkip ? MAX_TS_SIZE : reconQtStride); primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride); @@ -972,14 +986,14 @@ if (bTSkip) { - memcpy(coeffC, tskipCoeffC, sizeof(coeff_t) << (log2TrSizeC * 2)); - primitives.cu[sizeIdxC].copy_pp(reconQt, reconQtStride, tskipReconC, MAX_TS_SIZE); + memcpy(coeffC, m_tsCoeff, sizeof(coeff_t) << (log2TrSizeC * 2)); + primitives.cu[sizeIdxC].copy_pp(reconQt, reconQtStride, m_tsRecon, MAX_TS_SIZE); }
View file
x265_1.5.tar.gz/source/encoder/search.h -> x265_1.6.tar.gz/source/encoder/search.h
Changed
@@ -28,6 +28,7 @@ #include "predict.h" #include "quant.h" #include "bitcost.h" +#include "framedata.h" #include "yuv.h" #include "threadpool.h" @@ -35,6 +36,18 @@ #include "entropy.h" #include "motion.h" +#if DETAILED_CU_STATS +#define ProfileCUScopeNamed(name, cu, acc, count) \ + m_stats[cu.m_encData->m_frameEncoderID].count++; \ + ScopedElapsedTime name(m_stats[cu.m_encData->m_frameEncoderID].acc) +#define ProfileCUScope(cu, acc, count) ProfileCUScopeNamed(timedScope, cu, acc, count) +#define ProfileCounter(cu, count) m_stats[cu.m_encData->m_frameEncoderID].count++; +#else +#define ProfileCUScopeNamed(name, cu, acc, count) +#define ProfileCUScope(cu, acc, count) +#define ProfileCounter(cu, count) +#endif + namespace x265 { // private namespace @@ -88,6 +101,10 @@ MotionData bestME[MAX_INTER_PARTS][2]; MV amvpCand[2][MAX_NUM_REF][AMVP_NUM_CANDS]; + // Neighbour MVs of the current partition. 5 spatial candidates and the + // temporal candidate. + InterNeighbourMV interNeighbours[6]; + uint64_t rdCost; // sum of partition (psy) RD costs (sse(fenc, recon) + lambda2 * bits) uint64_t sa8dCost; // sum of partition sa8d distortion costs (sa8d(fenc, pred) + lambda * bits) uint32_t sa8dBits; // signal bits used in sa8dCost calculation @@ -109,8 +126,35 @@ coeffBits = 0; } + void invalidate() + { + /* set costs to invalid data, catch uninitialized re-use */ + rdCost = UINT64_MAX / 2; + sa8dCost = UINT64_MAX / 2; + sa8dBits = MAX_UINT / 2; + psyEnergy = MAX_UINT / 2; + distortion = MAX_UINT / 2; + totalBits = MAX_UINT / 2; + mvBits = MAX_UINT / 2; + coeffBits = MAX_UINT / 2; + } + + bool ok() const + { + return !(rdCost >= UINT64_MAX / 2 || + sa8dCost >= UINT64_MAX / 2 || + sa8dBits >= MAX_UINT / 2 || + psyEnergy >= MAX_UINT / 2 || + distortion >= MAX_UINT / 2 || + totalBits >= MAX_UINT / 2 || + mvBits >= MAX_UINT / 2 || + coeffBits >= MAX_UINT / 2); + } + void addSubCosts(const Mode& subMode) { + X265_CHECK(subMode.ok(), "sub-mode not initialized"); + rdCost += subMode.rdCost; sa8dCost += subMode.sa8dCost; sa8dBits += subMode.sa8dBits; @@ -122,16 +166,89 @@ } }; +#if DETAILED_CU_STATS +/* This structure is intended for performance debugging and we make no attempt + * to handle dynamic range overflows. Care should be taken to avoid long encodes + * if you care about the accuracy of these elapsed times and counters. This + * profiling is orthogonal to PPA/VTune and can be enabled independently from + * either of them */ +struct CUStats +{ + int64_t intraRDOElapsedTime[NUM_CU_DEPTH]; // elapsed worker time in intra RDO per CU depth + int64_t interRDOElapsedTime[NUM_CU_DEPTH]; // elapsed worker time in inter RDO per CU depth + int64_t intraAnalysisElapsedTime; // elapsed worker time in intra sa8d analysis + int64_t motionEstimationElapsedTime; // elapsed worker time in predInterSearch() + int64_t loopFilterElapsedTime; // elapsed worker time in deblock and SAO and PSNR/SSIM + int64_t pmeTime; // elapsed worker time processing ME slave jobs + int64_t pmeBlockTime; // elapsed worker time blocked for pme batch completion + int64_t pmodeTime; // elapsed worker time processing pmode slave jobs + int64_t pmodeBlockTime; // elapsed worker time blocked for pmode batch completion + int64_t weightAnalyzeTime; // elapsed worker time analyzing reference weights + int64_t totalCTUTime; // elapsed worker time in compressCTU (includes pmode master) + + uint64_t countIntraRDO[NUM_CU_DEPTH]; + uint64_t countInterRDO[NUM_CU_DEPTH]; + uint64_t countIntraAnalysis; + uint64_t countMotionEstimate; + uint64_t countLoopFilter; + uint64_t countPMETasks; + uint64_t countPMEMasters; + uint64_t countPModeTasks; + uint64_t countPModeMasters; + uint64_t countWeightAnalyze; + uint64_t totalCTUs; + + CUStats() { clear(); } + + void clear() + { + memset(this, 0, sizeof(*this)); + } + + void accumulate(CUStats& other) + { + for (uint32_t i = 0; i <= g_maxCUDepth; i++) + { + intraRDOElapsedTime[i] += other.intraRDOElapsedTime[i]; + interRDOElapsedTime[i] += other.interRDOElapsedTime[i]; + countIntraRDO[i] += other.countIntraRDO[i]; + countInterRDO[i] += other.countInterRDO[i]; + } + + intraAnalysisElapsedTime += other.intraAnalysisElapsedTime; + motionEstimationElapsedTime += other.motionEstimationElapsedTime; + loopFilterElapsedTime += other.loopFilterElapsedTime; + pmeTime += other.pmeTime; + pmeBlockTime += other.pmeBlockTime; + pmodeTime += other.pmodeTime; + pmodeBlockTime += other.pmodeBlockTime; + weightAnalyzeTime += other.weightAnalyzeTime; + totalCTUTime += other.totalCTUTime; + + countIntraAnalysis += other.countIntraAnalysis; + countMotionEstimate += other.countMotionEstimate; + countLoopFilter += other.countLoopFilter; + countPMETasks += other.countPMETasks; + countPMEMasters += other.countPMEMasters; + countPModeTasks += other.countPModeTasks; + countPModeMasters += other.countPModeMasters; + countWeightAnalyze += other.countWeightAnalyze; + totalCTUs += other.totalCTUs; + + other.clear(); + } +}; +#endif + inline int getTUBits(int idx, int numIdx) { return idx + (idx < numIdx - 1); } -class Search : public JobProvider, public Predict +class Search : public Predict { public: - static const pixel zeroPixel[MAX_CU_SIZE]; static const int16_t zeroShort[MAX_CU_SIZE]; MotionEstimate m_me; @@ -147,11 +264,25 @@ uint8_t* m_qtTempCbf[3]; uint8_t* m_qtTempTransformSkipFlag[3]; + pixel* m_fencScaled; /* 32x32 buffer for down-scaled version of 64x64 CU fenc */ + pixel* m_fencTransposed; /* 32x32 buffer for transposed copy of fenc */ + pixel* m_intraPred; /* 32x32 buffer for individual intra predictions */ + pixel* m_intraPredAngs; /* allocation for 33 consecutive (all angular) 32x32 intra predictions */ + + coeff_t* m_tsCoeff; /* transform skip coeff 32x32 */ + int16_t* m_tsResidual; /* transform skip residual 32x32 */ + pixel* m_tsRecon; /* transform skip reconstructed pixels 32x32 */ + bool m_bFrameParallel; bool m_bEnableRDOQ; uint32_t m_numLayers; uint32_t m_refLagPixels; +#if DETAILED_CU_STATS + /* Accumulate CU statistics separately for each frame encoder */ + CUStats m_stats[X265_MAX_FRAME_THREADS]; +#endif + Search(); ~Search(); @@ -162,7 +293,7 @@ void invalidateContexts(int fromDepth); // full RD search of intra modes. if sharedModes is not NULL, it directly uses them - void checkIntra(Mode& intraMode, const CUGeom& cuGeom, PartSize partSize, uint8_t* sharedModes); + void checkIntra(Mode& intraMode, const CUGeom& cuGeom, PartSize partSize, uint8_t* sharedModes, uint8_t* sharedChromaModes); // select best intra mode using only sa8d costs, cannot measure NxN intra
View file
x265_1.5.tar.gz/source/encoder/slicetype.cpp -> x265_1.6.tar.gz/source/encoder/slicetype.cpp
Changed
@@ -34,11 +34,17 @@ #include "motion.h" #include "ratecontrol.h" -#define NUM_CUS (m_widthInCU > 2 && m_heightInCU > 2 ? (m_widthInCU - 2) * (m_heightInCU - 2) : m_widthInCU * m_heightInCU) +#if DETAILED_CU_STATS +#define ProfileLookaheadTime(elapsed, count) ScopedElapsedTime _scope(elapsed); count++ +#else +#define ProfileLookaheadTime(elapsed, count) +#endif using namespace x265; -static inline int16_t median(int16_t a, int16_t b, int16_t c) +namespace { + +inline int16_t median(int16_t a, int16_t b, int16_t c) { int16_t t = (a - b) & ((a - b) >> 31); @@ -49,55 +55,531 @@ return b; } -static inline void median_mv(MV &dst, MV a, MV b, MV c) +inline void median_mv(MV &dst, MV a, MV b, MV c) { dst.x = median(a.x, b.x, c.x); dst.y = median(a.y, b.y, c.y); } +/* Compute variance to derive AC energy of each block */ +inline uint32_t acEnergyVar(Frame *curFrame, uint64_t sum_ssd, int shift, int plane) +{ + uint32_t sum = (uint32_t)sum_ssd; + uint32_t ssd = (uint32_t)(sum_ssd >> 32); + + curFrame->m_lowres.wp_sum[plane] += sum; + curFrame->m_lowres.wp_ssd[plane] += ssd; + return ssd - ((uint64_t)sum * sum >> shift); +} + +/* Find the energy of each block in Y/Cb/Cr plane */ +inline uint32_t acEnergyPlane(Frame *curFrame, pixel* src, intptr_t srcStride, int plane, int colorFormat) +{ + if ((colorFormat != X265_CSP_I444) && plane) + { + ALIGN_VAR_8(pixel, pix[8 * 8]); + primitives.cu[BLOCK_8x8].copy_pp(pix, 8, src, srcStride); + return acEnergyVar(curFrame, primitives.cu[BLOCK_8x8].var(pix, 8), 6, plane); + } + else + return acEnergyVar(curFrame, primitives.cu[BLOCK_16x16].var(src, srcStride), 8, plane); +} + +} // end anonymous namespace + +/* Find the total AC energy of each block in all planes */ +uint32_t LookaheadTLD::acEnergyCu(Frame* curFrame, uint32_t blockX, uint32_t blockY, int csp) +{ + intptr_t stride = curFrame->m_fencPic->m_stride; + intptr_t cStride = curFrame->m_fencPic->m_strideC; + intptr_t blockOffsetLuma = blockX + (blockY * stride); + int hShift = CHROMA_H_SHIFT(csp); + int vShift = CHROMA_V_SHIFT(csp); + intptr_t blockOffsetChroma = (blockX >> hShift) + ((blockY >> vShift) * cStride); + + uint32_t var; + + var = acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[0] + blockOffsetLuma, stride, 0, csp); + var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[1] + blockOffsetChroma, cStride, 1, csp); + var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[2] + blockOffsetChroma, cStride, 2, csp); + x265_emms(); + return var; +} + +void LookaheadTLD::calcAdaptiveQuantFrame(Frame *curFrame, x265_param* param) +{ + /* Actual adaptive quantization */ + int maxCol = curFrame->m_fencPic->m_picWidth; + int maxRow = curFrame->m_fencPic->m_picHeight; + int blockWidth = ((param->sourceWidth / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; + int blockHeight = ((param->sourceHeight / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; + int blockCount = blockWidth * blockHeight; + + for (int y = 0; y < 3; y++) + { + curFrame->m_lowres.wp_ssd[y] = 0; + curFrame->m_lowres.wp_sum[y] = 0; + } + + /* Calculate Qp offset for each 16x16 block in the frame */ + int blockXY = 0; + int blockX = 0, blockY = 0; + double strength = 0.f; + if (param->rc.aqMode == X265_AQ_NONE || param->rc.aqStrength == 0) + { + /* Need to init it anyways for CU tree */ + int cuCount = widthInCU * heightInCU; + + if (param->rc.aqMode && param->rc.aqStrength == 0) + { + memset(curFrame->m_lowres.qpCuTreeOffset, 0, cuCount * sizeof(double)); + memset(curFrame->m_lowres.qpAqOffset, 0, cuCount * sizeof(double)); + for (int cuxy = 0; cuxy < cuCount; cuxy++) + curFrame->m_lowres.invQscaleFactor[cuxy] = 256; + } + + /* Need variance data for weighted prediction */ + if (param->bEnableWeightedPred || param->bEnableWeightedBiPred) + { + for (blockY = 0; blockY < maxRow; blockY += 16) + for (blockX = 0; blockX < maxCol; blockX += 16) + acEnergyCu(curFrame, blockX, blockY, param->internalCsp); + } + } + else + { + blockXY = 0; + double avg_adj_pow2 = 0, avg_adj = 0, qp_adj = 0; + if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE) + { + double bit_depth_correction = pow(1 << (X265_DEPTH - 8), 0.5); + for (blockY = 0; blockY < maxRow; blockY += 16) + { + for (blockX = 0; blockX < maxCol; blockX += 16) + { + uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp); + qp_adj = pow(energy + 1, 0.1); + curFrame->m_lowres.qpCuTreeOffset[blockXY] = qp_adj; + avg_adj += qp_adj; + avg_adj_pow2 += qp_adj * qp_adj; + blockXY++; + } + } + + avg_adj /= blockCount; + avg_adj_pow2 /= blockCount; + strength = param->rc.aqStrength * avg_adj / bit_depth_correction; + avg_adj = avg_adj - 0.5f * (avg_adj_pow2 - (11.f * bit_depth_correction)) / avg_adj; + } + else + strength = param->rc.aqStrength * 1.0397f; + + blockXY = 0; + for (blockY = 0; blockY < maxRow; blockY += 16) + { + for (blockX = 0; blockX < maxCol; blockX += 16) + { + if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE) + { + qp_adj = curFrame->m_lowres.qpCuTreeOffset[blockXY]; + qp_adj = strength * (qp_adj - avg_adj); + } + else + { + uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp); + qp_adj = strength * (X265_LOG2(X265_MAX(energy, 1)) - (14.427f + 2 * (X265_DEPTH - 8))); + } + curFrame->m_lowres.qpAqOffset[blockXY] = qp_adj; + curFrame->m_lowres.qpCuTreeOffset[blockXY] = qp_adj; + curFrame->m_lowres.invQscaleFactor[blockXY] = x265_exp2fix8(qp_adj); + blockXY++; + } + } + } + + if (param->bEnableWeightedPred || param->bEnableWeightedBiPred) + { + int hShift = CHROMA_H_SHIFT(param->internalCsp); + int vShift = CHROMA_V_SHIFT(param->internalCsp); + maxCol = ((maxCol + 8) >> 4) << 4; + maxRow = ((maxRow + 8) >> 4) << 4; + int width[3] = { maxCol, maxCol >> hShift, maxCol >> hShift }; + int height[3] = { maxRow, maxRow >> vShift, maxRow >> vShift }; + + for (int i = 0; i < 3; i++) + { + uint64_t sum, ssd; + sum = curFrame->m_lowres.wp_sum[i]; + ssd = curFrame->m_lowres.wp_ssd[i]; + curFrame->m_lowres.wp_ssd[i] = ssd - (sum * sum + (width[i] * height[i]) / 2) / (width[i] * height[i]); + } + } +} + +void LookaheadTLD::lowresIntraEstimate(Lowres& fenc) +{ + ALIGN_VAR_32(pixel, prediction[X265_LOWRES_CU_SIZE * X265_LOWRES_CU_SIZE]); + pixel fencIntra[X265_LOWRES_CU_SIZE * X265_LOWRES_CU_SIZE]; + pixel neighbours[2][X265_LOWRES_CU_SIZE * 4 + 1]; + pixel* samples = neighbours[0], *filtered = neighbours[1]; + + const int lookAheadLambda = (int)x265_lambda_tab[X265_LOOKAHEAD_QP]; + const int intraPenalty = 5 * lookAheadLambda; + const int lowresPenalty = 4; /* fixed CU cost overhead */ + + const int cuSize = X265_LOWRES_CU_SIZE; + const int cuSize2 = cuSize << 1; + const int sizeIdx = X265_LOWRES_CU_BITS - 2;
View file
x265_1.5.tar.gz/source/encoder/slicetype.h -> x265_1.6.tar.gz/source/encoder/slicetype.h
Changed
@@ -28,141 +28,135 @@ #include "slice.h" #include "motion.h" #include "piclist.h" -#include "wavefront.h" +#include "threadpool.h" namespace x265 { // private namespace struct Lowres; class Frame; +class Lookahead; #define LOWRES_COST_MASK ((1 << 14) - 1) #define LOWRES_COST_SHIFT 14 -#define SET_WEIGHT(w, b, s, d, o) \ - { \ - (w).inputWeight = (s); \ - (w).log2WeightDenom = (d); \ - (w).inputOffset = (o); \ - (w).bPresentFlag = b; \ - } - -class EstimateRow +/* Thread local data for lookahead tasks */ +struct LookaheadTLD { -public: - x265_param* m_param; - MotionEstimate m_me; - Lock m_lock; - - volatile uint32_t m_completed; // Number of CUs in this row for which cost estimation is completed - volatile bool m_active; - - uint64_t m_costEst; // Estimated cost for all CUs in a row - uint64_t m_costEstAq; // Estimated weight Aq cost for all CUs in a row - uint64_t m_costIntraAq; // Estimated weighted Aq Intra cost for all CUs in a row - int m_intraMbs; // Number of Intra CUs - int m_costIntra; // Estimated Intra cost for all CUs in a row - - int m_merange; - int m_lookAheadLambda; - - int m_widthInCU; - int m_heightInCU; - - EstimateRow() + MotionEstimate me; + ReferencePlanes weightedRef; + pixel* wbuffer[4]; + int widthInCU; + int heightInCU; + int ncu; + int paddedLines; + +#if DETAILED_CU_STATS + int64_t batchElapsedTime; + int64_t coopSliceElapsedTime; + uint64_t countBatches; + uint64_t countCoopSlices; +#endif + + LookaheadTLD() { - m_me.setQP(X265_LOOKAHEAD_QP); - m_me.init(X265_HEX_SEARCH, 1, X265_CSP_I400); - m_merange = 16; - m_lookAheadLambda = (int)x265_lambda_tab[X265_LOOKAHEAD_QP]; + me.setQP(X265_LOOKAHEAD_QP); + me.init(X265_HEX_SEARCH, 1, X265_CSP_I400); + for (int i = 0; i < 4; i++) + wbuffer[i] = NULL; + widthInCU = heightInCU = ncu = paddedLines = 0; + +#if DETAILED_CU_STATS + batchElapsedTime = 0; + coopSliceElapsedTime = 0; + countBatches = 0; + countCoopSlices = 0; +#endif } - void init(); - - void estimateCUCost(Lowres * *frames, ReferencePlanes * wfref0, int cux, int cuy, int p0, int p1, int b, bool bDoSearch[2]); -}; - -/* CostEstimate manages the cost estimation of a single frame, ie: - * estimateFrameCost() and everything below it in the call graph */ -class CostEstimate : public WaveFront -{ -public: - CostEstimate(ThreadPool *p); - ~CostEstimate(); - void init(x265_param *, Frame *); - - x265_param *m_param; - EstimateRow *m_rows; - pixel *m_wbuffer[4]; - Lowres **m_curframes; - - ReferencePlanes m_weightedRef; - WeightParam m_w; + void init(int w, int h, int n) + { + widthInCU = w; + heightInCU = h; + ncu = n; + } - int m_paddedLines; // number of lines in padded frame - int m_widthInCU; // width of lowres frame in downscale CUs - int m_heightInCU; // height of lowres frame in downscale CUs + ~LookaheadTLD() { X265_FREE(wbuffer[0]); } - bool m_bDoSearch[2]; - volatile bool m_bFrameCompleted; - int m_curb, m_curp0, m_curp1; + void calcAdaptiveQuantFrame(Frame *curFrame, x265_param* param); + void lowresIntraEstimate(Lowres& fenc); - void processRow(int row, int threadId); - int64_t estimateFrameCost(Lowres **frames, int p0, int p1, int b, bool bIntraPenalty); + void weightsAnalyse(Lowres& fenc, Lowres& ref); protected: - void weightsAnalyse(Lowres **frames, int b, int p0); - uint32_t weightCostLuma(Lowres **frames, int b, int p0, WeightParam *w); + uint32_t acEnergyCu(Frame* curFrame, uint32_t blockX, uint32_t blockY, int csp); + uint32_t weightCostLuma(Lowres& fenc, Lowres& ref, WeightParam& wp); + bool allocWeightedRef(Lowres& fenc); }; class Lookahead : public JobProvider { public: + PicList m_inputQueue; // input pictures in order received + PicList m_outputQueue; // pictures to be encoded, in encode order + Lock m_inputLock; + Lock m_outputLock; + + /* pre-lookahead */ + Frame* m_preframes[X265_LOOKAHEAD_MAX]; + int m_preTotal, m_preAcquired, m_preCompleted; + int m_fullQueueSize; + bool m_isActive; + bool m_sliceTypeBusy; + bool m_bAdaptiveQuant; + bool m_outputSignalRequired; + bool m_bBatchMotionSearch; + bool m_bBatchFrameCosts; + Lock m_preLookaheadLock; + Event m_outputSignal; + + LookaheadTLD* m_tld; + x265_param* m_param; + Lowres* m_lastNonB; + int* m_scratch; // temp buffer for cutree propagate + + int m_histogram[X265_BFRAME_MAX + 1]; + int m_lastKeyframe; + int m_8x8Width; + int m_8x8Height; + int m_8x8Blocks; + int m_numCoopSlices; + int m_numRowsPerSlice; + bool m_filled; + Lookahead(x265_param *param, ThreadPool *pool); - ~Lookahead(); - void init(); - void destroy(); - CostEstimate m_est; // Frame cost estimator - PicList m_inputQueue; // input pictures in order received - PicList m_outputQueue; // pictures to be encoded, in encode order +#if DETAILED_CU_STATS + int64_t m_slicetypeDecideElapsedTime; + int64_t m_preLookaheadElapsedTime; + uint64_t m_countSlicetypeDecide; + uint64_t m_countPreLookahead; + void getWorkerStats(int64_t& batchElapsedTime, uint64_t& batchCount, int64_t& coopSliceElapsedTime, uint64_t& coopSliceCount); +#endif - x265_param *m_param; - Lowres *m_lastNonB; - int *m_scratch; // temp buffer + bool create(); + void destroy(); + void stop(); - int m_widthInCU; // width of lowres frame in downscale CUs - int m_heightInCU; // height of lowres frame in downscale CUs - int m_lastKeyframe; - int m_histogram[X265_BFRAME_MAX + 1];
View file
x265_1.5.tar.gz/source/encoder/weightPrediction.cpp -> x265_1.6.tar.gz/source/encoder/weightPrediction.cpp
Changed
@@ -27,8 +27,8 @@ #include "frame.h" #include "picyuv.h" #include "lowres.h" +#include "slice.h" #include "mv.h" -#include "slicetype.h" #include "bitstream.h" using namespace x265; @@ -58,6 +58,7 @@ void mcLuma(pixel* mcout, Lowres& ref, const MV * mvs) { intptr_t stride = ref.lumaStride; + const int mvshift = 1 << 2; const int cuSize = 8; MV mvmin, mvmax; @@ -66,15 +67,15 @@ for (int y = 0; y < ref.lines; y += cuSize) { intptr_t pixoff = y * stride; - mvmin.y = (int16_t)((-y - 8) << 2); - mvmax.y = (int16_t)((ref.lines - y - 1 + 8) << 2); + mvmin.y = (int16_t)((-y - 8) * mvshift); + mvmax.y = (int16_t)((ref.lines - y - 1 + 8) * mvshift); for (int x = 0; x < ref.width; x += cuSize, pixoff += cuSize, cu++) { ALIGN_VAR_16(pixel, buf8x8[8 * 8]); intptr_t bstride = 8; - mvmin.x = (int16_t)((-x - 8) << 2); - mvmax.x = (int16_t)((ref.width - x - 1 + 8) << 2); + mvmin.x = (int16_t)((-x - 8) * mvshift); + mvmax.x = (int16_t)((ref.width - x - 1 + 8) * mvshift); /* clip MV to available pixels */ MV mv = mvs[cu]; @@ -100,6 +101,7 @@ int csp = cache.csp; int bw = 16 >> cache.hshift; int bh = 16 >> cache.vshift; + const int mvshift = 1 << 2; MV mvmin, mvmax; for (int y = 0; y < height; y += bh) @@ -109,8 +111,8 @@ * into the lowres structures */ int cu = y * cache.lowresWidthInCU; intptr_t pixoff = y * stride; - mvmin.y = (int16_t)((-y - 8) << 2); - mvmax.y = (int16_t)((height - y - 1 + 8) << 2); + mvmin.y = (int16_t)((-y - 8) * mvshift); + mvmax.y = (int16_t)((height - y - 1 + 8) * mvshift); for (int x = 0; x < width; x += bw, cu++, pixoff += bw) { @@ -122,8 +124,8 @@ mv.y >>= cache.vshift; /* clip MV to available pixels */ - mvmin.x = (int16_t)((-x - 8) << 2); - mvmax.x = (int16_t)((width - x - 1 + 8) << 2); + mvmin.x = (int16_t)((-x - 8) * mvshift); + mvmax.x = (int16_t)((width - x - 1 + 8) * mvshift); mv = mv.clipped(mvmin, mvmax); intptr_t fpeloffset = (mv.y >> 2) * stride + (mv.x >> 2);
View file
x265_1.5.tar.gz/source/input/y4m.cpp -> x265_1.6.tar.gz/source/input/y4m.cpp
Changed
@@ -177,147 +177,118 @@ int csp = 0; int d = 0; - while (!ifs->eof()) + while (ifs->good()) { // Skip Y4MPEG string int c = ifs->get(); - while (!ifs->eof() && (c != ' ') && (c != '\n')) - { + while (ifs->good() && (c != ' ') && (c != '\n')) c = ifs->get(); - } - while (c == ' ' && !ifs->eof()) + while (c == ' ' && ifs->good()) { // read parameter identifier switch (ifs->get()) { case 'W': width = 0; - while (!ifs->eof()) + while (ifs->good()) { c = ifs->get(); if (c == ' ' || c == '\n') - { break; - } else - { width = width * 10 + (c - '0'); - } } - break; case 'H': height = 0; - while (!ifs->eof()) + while (ifs->good()) { c = ifs->get(); if (c == ' ' || c == '\n') - { break; - } else - { height = height * 10 + (c - '0'); - } } - break; case 'F': rateNum = 0; rateDenom = 0; - while (!ifs->eof()) + while (ifs->good()) { c = ifs->get(); if (c == '.') { rateDenom = 1; - while (!ifs->eof()) + while (ifs->good()) { c = ifs->get(); if (c == ' ' || c == '\n') - { break; - } else { rateNum = rateNum * 10 + (c - '0'); rateDenom = rateDenom * 10; } } - break; } else if (c == ':') { - while (!ifs->eof()) + while (ifs->good()) { c = ifs->get(); if (c == ' ' || c == '\n') - { break; - } else rateDenom = rateDenom * 10 + (c - '0'); } - break; } else - { rateNum = rateNum * 10 + (c - '0'); - } } - break; case 'A': sarWidth = 0; sarHeight = 0; - while (!ifs->eof()) + while (ifs->good()) { c = ifs->get(); if (c == ':') { - while (!ifs->eof()) + while (ifs->good()) { c = ifs->get(); if (c == ' ' || c == '\n') - { break; - } else sarHeight = sarHeight * 10 + (c - '0'); } - break; } else - { sarWidth = sarWidth * 10 + (c - '0'); - } } - break; case 'C': csp = 0; d = 0; - while (!ifs->eof()) + while (ifs->good()) { c = ifs->get(); if (c <= '9' && c >= '0') - { csp = csp * 10 + (c - '0'); - } else if (c == 'p') { // example: C420p16 - while (!ifs->eof()) + while (ifs->good()) { c = ifs->get(); @@ -338,22 +309,19 @@ break; default: - while (!ifs->eof()) + while (ifs->good()) { // consume this unsupported configuration word c = ifs->get(); if (c == ' ' || c == '\n') break; } - break; } } if (c == '\n') - { break; - } } if (width < MIN_FRAME_WIDTH || width > MAX_FRAME_WIDTH ||
View file
x265_1.5.tar.gz/source/output/y4m.cpp -> x265_1.6.tar.gz/source/output/y4m.cpp
Changed
@@ -46,9 +46,7 @@ } for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++) - { frameSize += (uint32_t)((width >> x265_cli_csps[colorSpace].width[i]) * (height >> x265_cli_csps[colorSpace].height[i])); - } } Y4MOutput::~Y4MOutput() @@ -66,14 +64,10 @@ #if HIGH_BIT_DEPTH if (pic.bitDepth > 8 && pic.poc == 0) - { x265_log(NULL, X265_LOG_WARNING, "y4m: down-shifting reconstructed pixels to 8 bits\n"); - } #else if (pic.bitDepth > 8 && pic.poc == 0) - { x265_log(NULL, X265_LOG_WARNING, "y4m: forcing reconstructed pixels to 8 bits\n"); - } #endif X265_CHECK(pic.colorSpace == colorSpace, "invalid color space\n"); @@ -89,9 +83,7 @@ for (int h = 0; h < height >> x265_cli_csps[colorSpace].height[i]; h++) { for (int w = 0; w < width >> x265_cli_csps[colorSpace].width[i]; w++) - { buf[w] = (char)(src[w] >> shift); - } ofs.write(buf, width >> x265_cli_csps[colorSpace].width[i]); src += pic.stride[i] / sizeof(*src);
View file
x265_1.5.tar.gz/source/output/yuv.cpp -> x265_1.6.tar.gz/source/output/yuv.cpp
Changed
@@ -39,9 +39,7 @@ buf = new char[width]; for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++) - { frameSize += (uint32_t)((width >> x265_cli_csps[colorSpace].width[i]) * (height >> x265_cli_csps[colorSpace].height[i])); - } } YUVOutput::~YUVOutput() @@ -69,9 +67,7 @@ for (int h = 0; h < height >> x265_cli_csps[colorSpace].height[i]; h++) { for (int w = 0; w < width >> x265_cli_csps[colorSpace].width[i]; w++) - { buf[w] = (char)(src[w] >> shift); - } ofs.write(buf, width >> x265_cli_csps[colorSpace].width[i]); src += pic.stride[i] / sizeof(*src);
View file
x265_1.5.tar.gz/source/profile/cpuEvents.h -> x265_1.6.tar.gz/source/profile/cpuEvents.h
Changed
@@ -5,6 +5,7 @@ CPU_EVENT(filterCTURow) CPU_EVENT(slicetypeDecideEV) CPU_EVENT(prelookahead) -CPU_EVENT(costEstimateRow) +CPU_EVENT(estCostSingle) +CPU_EVENT(estCostCoop) CPU_EVENT(pmode) CPU_EVENT(pme)
View file
x265_1.5.tar.gz/source/test/CMakeLists.txt -> x265_1.6.tar.gz/source/test/CMakeLists.txt
Changed
@@ -23,3 +23,6 @@ ipfilterharness.cpp ipfilterharness.h intrapredharness.cpp intrapredharness.h) target_link_libraries(TestBench x265-static ${PLATFORM_LIBS}) +if(LINKER_OPTIONS) + set_target_properties(TestBench PROPERTIES LINK_FLAGS ${LINKER_OPTIONS}) +endif()
View file
x265_1.5.tar.gz/source/test/ipfilterharness.cpp -> x265_1.6.tar.gz/source/test/ipfilterharness.cpp
Changed
@@ -61,7 +61,7 @@ } } -bool IPFilterHarness::check_IPFilter_primitive(filter_p2s_t ref, filter_p2s_t opt, int isChroma, int csp) +bool IPFilterHarness::check_IPFilter_primitive(filter_p2s_wxh_t ref, filter_p2s_wxh_t opt, int isChroma, int csp) { intptr_t rand_srcStride; int min_size = isChroma ? 2 : 4; @@ -512,6 +512,46 @@ return true; } +bool IPFilterHarness::check_IPFilterLumaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt) +{ + for (int i = 0; i < ITERS; i++) + { + intptr_t rand_srcStride = rand() % 100; + int index = i % TEST_CASES; + + ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s); + + checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s); + + if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel))) + return false; + + reportfail(); + } + + return true; +} + +bool IPFilterHarness::check_IPFilterChromaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt) +{ + for (int i = 0; i < ITERS; i++) + { + intptr_t rand_srcStride = rand() % 100; + int index = i % TEST_CASES; + + ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s); + + checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s); + + if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel))) + return false; + + reportfail(); + } + + return true; +} + bool IPFilterHarness::testCorrectness(const EncoderPrimitives& ref, const EncoderPrimitives& opt) { if (opt.luma_p2s) @@ -582,6 +622,14 @@ return false; } } + if (opt.pu[value].filter_p2s) + { + if (!check_IPFilterLumaP2S_primitive(ref.pu[value].filter_p2s, opt.pu[value].filter_p2s)) + { + printf("filter_p2s[%s]", lumaPartStr[value]); + return false; + } + } } for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++) @@ -644,6 +692,14 @@ return false; } } + if (opt.chroma[csp].pu[value].chroma_p2s) + { + if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].chroma_p2s, opt.chroma[csp].pu[value].chroma_p2s)) + { + printf("chroma_p2s[%s]", chromaPartStr[csp][value]); + return false; + } + } } } @@ -720,6 +776,13 @@ REPORT_SPEEDUP(opt.pu[value].luma_hvpp, ref.pu[value].luma_hvpp, pixel_buff + 3 * srcStride, srcStride, IPF_vec_output_p, srcStride, 1, 3); } + + if (opt.pu[value].filter_p2s) + { + printf("filter_p2s [%s]\t", lumaPartStr[value]); + REPORT_SPEEDUP(opt.pu[value].filter_p2s, ref.pu[value].filter_p2s, + pixel_buff, srcStride, IPF_vec_output_s); + } } for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++) @@ -773,6 +836,14 @@ short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, IPF_vec_output_s, dstStride, 1); } + + if (opt.chroma[csp].pu[value].chroma_p2s) + { + printf("chroma_p2s[%s]\t", chromaPartStr[csp][value]); + REPORT_SPEEDUP(opt.chroma[csp].pu[value].chroma_p2s, ref.chroma[csp].pu[value].chroma_p2s, + pixel_buff, srcStride, + IPF_vec_output_s); + } } } }
View file
x265_1.5.tar.gz/source/test/ipfilterharness.h -> x265_1.6.tar.gz/source/test/ipfilterharness.h
Changed
@@ -50,7 +50,7 @@ pixel pixel_test_buff[TEST_CASES][TEST_BUF_SIZE]; int16_t short_test_buff[TEST_CASES][TEST_BUF_SIZE]; - bool check_IPFilter_primitive(filter_p2s_t ref, filter_p2s_t opt, int isChroma, int csp); + bool check_IPFilter_primitive(filter_p2s_wxh_t ref, filter_p2s_wxh_t opt, int isChroma, int csp); bool check_IPFilterChroma_primitive(filter_pp_t ref, filter_pp_t opt); bool check_IPFilterChroma_ps_primitive(filter_ps_t ref, filter_ps_t opt); bool check_IPFilterChroma_hps_primitive(filter_hps_t ref, filter_hps_t opt); @@ -62,6 +62,8 @@ bool check_IPFilterLuma_sp_primitive(filter_sp_t ref, filter_sp_t opt); bool check_IPFilterLuma_ss_primitive(filter_ss_t ref, filter_ss_t opt); bool check_IPFilterLumaHV_primitive(filter_hv_pp_t ref, filter_hv_pp_t opt); + bool check_IPFilterLumaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt); + bool check_IPFilterChromaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt); public:
View file
x265_1.5.tar.gz/source/test/mbdstharness.cpp -> x265_1.6.tar.gz/source/test/mbdstharness.cpp
Changed
@@ -209,7 +209,7 @@ for (int i = 0; i < ITERS; i++) { - int width = (rand() % 4 + 1) * 4; + int width = 1 << (rand() % 4 + 2); int height = width; uint32_t optReturnValue = 0; @@ -278,42 +278,19 @@ return true; } - bool MBDstHarness::check_count_nonzero_primitive(count_nonzero_t ref, count_nonzero_t opt) { - ALIGN_VAR_32(int16_t, qcoeff[32 * 32]); - - for (int i = 0; i < 4; i++) + int j = 0; + for (int i = 0; i < ITERS; i++) { - int log2TrSize = i + 2; - int num = 1 << (log2TrSize * 2); - int mask = num - 1; - - for (int n = 0; n <= num; n++) - { - memset(qcoeff, 0, num * sizeof(int16_t)); - - for (int j = 0; j < n; j++) - { - int k = rand() & mask; - while (qcoeff[k]) - { - k = (k + 11) & mask; - } - - qcoeff[k] = (int16_t)rand() - RAND_MAX / 2; - } - - int refval = ref(qcoeff, num); - int optval = (int)checked(opt, qcoeff, num); - - if (refval != optval) - return false; - - reportfail(); - } + int index = i % TEST_CASES; + int opt_cnt = (int)checked(opt, short_test_buff[index] + j); + int ref_cnt = ref(short_test_buff[index] + j); + if (ref_cnt != opt_cnt) + return false; + reportfail(); + j += INCR; } - return true; } @@ -437,16 +414,17 @@ return false; } } - - if (opt.count_nonzero) + for (int i = 0; i < NUM_TR_SIZE; i++) { - if (!check_count_nonzero_primitive(ref.count_nonzero, opt.count_nonzero)) + if (opt.cu[i].count_nonzero) { - printf("count_nonzero: Failed!\n"); - return false; + if (!check_count_nonzero_primitive(ref.cu[i].count_nonzero, opt.cu[i].count_nonzero)) + { + printf("count_nonzero[%dx%d] Failed!\n", 4 << i, 4 << i); + return false; + } } } - if (opt.dequant_scaling) { if (!check_dequant_primitive(ref.dequant_scaling, opt.dequant_scaling)) @@ -523,16 +501,14 @@ printf("nquant\t\t"); REPORT_SPEEDUP(opt.nquant, ref.nquant, short_test_buff[0], int_test_buff[1], mshortbuf2, 23, 23785, 32 * 32); } - - if (opt.count_nonzero) + for (int value = 0; value < NUM_TR_SIZE; value++) { - for (int i = 4; i <= 32; i <<= 1) + if (opt.cu[value].count_nonzero) { - printf("count_nonzero[%dx%d]", i, i); - REPORT_SPEEDUP(opt.count_nonzero, ref.count_nonzero, mbuf1, i * i) + printf("count_nonzero[%dx%d]", 4 << value, 4 << value); + REPORT_SPEEDUP(opt.cu[value].count_nonzero, ref.cu[value].count_nonzero, mbuf1); } } - if (opt.denoiseDct) { printf("denoiseDct\t");
View file
x265_1.5.tar.gz/source/test/pixelharness.cpp -> x265_1.6.tar.gz/source/test/pixelharness.cpp
Changed
@@ -1149,6 +1149,71 @@ return true; } +bool PixelHarness::check_findPosLast(findPosLast_t ref, findPosLast_t opt) +{ + ALIGN_VAR_16(coeff_t, ref_src[32 * 32 + ITERS * 2]); + uint8_t ref_coeffNum[MLS_GRP_NUM], opt_coeffNum[MLS_GRP_NUM]; // value range[0, 16] + uint16_t ref_coeffSign[MLS_GRP_NUM], opt_coeffSign[MLS_GRP_NUM]; // bit mask map for non-zero coeff sign + uint16_t ref_coeffFlag[MLS_GRP_NUM], opt_coeffFlag[MLS_GRP_NUM]; // bit mask map for non-zero coeff + + int totalCoeffs = 0; + for (int i = 0; i < 32 * 32; i++) + { + ref_src[i] = rand() & SHORT_MAX; + totalCoeffs += (ref_src[i] != 0); + } + + // extra test area all of 0x1234 + for (int i = 0; i < ITERS * 2; i++) + { + ref_src[32 * 32 + i] = 0x1234; + } + + + memset(ref_coeffNum, 0xCD, sizeof(ref_coeffNum)); + memset(ref_coeffSign, 0xCD, sizeof(ref_coeffSign)); + memset(ref_coeffFlag, 0xCD, sizeof(ref_coeffFlag)); + + memset(opt_coeffNum, 0xCD, sizeof(opt_coeffNum)); + memset(opt_coeffSign, 0xCD, sizeof(opt_coeffSign)); + memset(opt_coeffFlag, 0xCD, sizeof(opt_coeffFlag)); + + for (int i = 0; i < ITERS; i++) + { + int rand_scan_type = rand() % NUM_SCAN_TYPE; + int rand_scan_size = rand() % NUM_SCAN_SIZE; + int rand_numCoeff = 0; + + for (int j = 0; j < 1 << (2 * (rand_scan_size + 2)); j++) + rand_numCoeff += (ref_src[i + j] != 0); + + const uint16_t* const scanTbl = g_scanOrder[rand_scan_type][rand_scan_size]; + + int ref_scanPos = ref(scanTbl, ref_src + i, ref_coeffSign, ref_coeffFlag, ref_coeffNum, rand_numCoeff); + int opt_scanPos = (int)checked(opt, scanTbl, ref_src + i, opt_coeffSign, opt_coeffFlag, opt_coeffNum, rand_numCoeff); + + if (ref_scanPos != opt_scanPos) + return false; + + for (int j = 0; rand_numCoeff; j++) + { + if (ref_coeffSign[j] != opt_coeffSign[j]) + return false; + + if (ref_coeffFlag[j] != opt_coeffFlag[j]) + return false; + + if (ref_coeffNum[j] != opt_coeffNum[j]) + return false; + + rand_numCoeff -= ref_coeffNum[j]; + } + + reportfail(); + } + + return true; +} bool PixelHarness::testPU(int part, const EncoderPrimitives& ref, const EncoderPrimitives& opt) { @@ -1299,6 +1364,14 @@ return false; } } + if (opt.chroma[i].pu[part].satd) + { + if (!check_pixelcmp(ref.chroma[i].pu[part].satd, opt.chroma[i].pu[part].satd)) + { + printf("chroma_satd[%s][%s] failed!\n", x265_source_csp_names[i], chromaPartStr[i][part]); + return false; + } + } if (part < NUM_CU_SIZES) { if (opt.chroma[i].cu[part].sub_ps) @@ -1467,7 +1540,7 @@ { if (!check_cpy2Dto1D_shl_t(ref.cu[i].cpy2Dto1D_shl, opt.cu[i].cpy2Dto1D_shl)) { - printf("cpy2Dto1D_shl failed!\n"); + printf("cpy2Dto1D_shl[%dx%d] failed!\n", 4 << i, 4 << i); return false; } } @@ -1645,6 +1718,15 @@ } } + if (opt.findPosLast) + { + if (!check_findPosLast(ref.findPosLast, opt.findPosLast)) + { + printf("findPosLast failed!\n"); + return false; + } + } + return true; } @@ -1688,7 +1770,7 @@ if (opt.pu[part].copy_pp) { HEADER("copy_pp[%s]", lumaPartStr[part]); - REPORT_SPEEDUP(opt.pu[part].copy_pp, ref.pu[part].copy_pp, pbuf1, 64, pbuf2, 128); + REPORT_SPEEDUP(opt.pu[part].copy_pp, ref.pu[part].copy_pp, pbuf1, 64, pbuf2, 64); } if (opt.pu[part].addAvg) @@ -1723,7 +1805,7 @@ if (opt.cu[part].copy_ss) { HEADER("copy_ss[%s]", lumaPartStr[part]); - REPORT_SPEEDUP(opt.cu[part].copy_ss, ref.cu[part].copy_ss, sbuf1, 64, sbuf2, 128); + REPORT_SPEEDUP(opt.cu[part].copy_ss, ref.cu[part].copy_ss, sbuf1, 128, sbuf2, 128); } if (opt.cu[part].copy_sp) { @@ -1733,7 +1815,7 @@ if (opt.cu[part].copy_ps) { HEADER("copy_ps[%s]", lumaPartStr[part]); - REPORT_SPEEDUP(opt.cu[part].copy_ps, ref.cu[part].copy_ps, sbuf1, 64, pbuf1, 128); + REPORT_SPEEDUP(opt.cu[part].copy_ps, ref.cu[part].copy_ps, sbuf1, 128, pbuf1, 64); } } @@ -1749,6 +1831,11 @@ HEADER("[%s] addAvg[%s]", x265_source_csp_names[i], chromaPartStr[i][part]); REPORT_SPEEDUP(opt.chroma[i].pu[part].addAvg, ref.chroma[i].pu[part].addAvg, sbuf1, sbuf2, pbuf1, STRIDE, STRIDE, STRIDE); } + if (opt.chroma[i].pu[part].satd) + { + HEADER("[%s] satd[%s]", x265_source_csp_names[i], chromaPartStr[i][part]); + REPORT_SPEEDUP(opt.chroma[i].pu[part].satd, ref.chroma[i].pu[part].satd, pbuf1, STRIDE, fref, STRIDE); + } if (part < NUM_CU_SIZES) { if (opt.chroma[i].cu[part].copy_ss) @@ -1990,4 +2077,13 @@ HEADER0("propagateCost"); REPORT_SPEEDUP(opt.propagateCost, ref.propagateCost, ibuf1, ushort_test_buff[0], int_test_buff[0], ushort_test_buff[0], int_test_buff[0], double_test_buff[0], 80); } + + if (opt.findPosLast) + { + HEADER0("findPosLast"); + coeff_t coefBuf[32 * 32]; + memset(coefBuf, 0, sizeof(coefBuf)); + memset(coefBuf + 32 * 31, 1, 32 * sizeof(coeff_t)); + REPORT_SPEEDUP(opt.findPosLast, ref.findPosLast, g_scanOrder[SCAN_DIAG][NUM_SCAN_SIZE - 1], coefBuf, (uint16_t*)sbuf1, (uint16_t*)sbuf2, (uint8_t*)psbuf1, 32); + } }
View file
x265_1.5.tar.gz/source/test/pixelharness.h -> x265_1.6.tar.gz/source/test/pixelharness.h
Changed
@@ -104,6 +104,7 @@ bool check_psyCost_pp(pixelcmp_t ref, pixelcmp_t opt); bool check_psyCost_ss(pixelcmp_ss_t ref, pixelcmp_ss_t opt); bool check_calSign(sign_t ref, sign_t opt); + bool check_findPosLast(findPosLast_t ref, findPosLast_t opt); public:
View file
x265_1.6.tar.gz/source/test/rate-control-tests.txt
Added
@@ -0,0 +1,34 @@ +# List of command lines to be run by rate control regression tests, see https://bitbucket.org/sborho/test-harness + +# This test is listed first since it currently reproduces bugs +big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --pass 1 -F4,--preset medium --bitrate 1000 --pass 2 -F4 + +# VBV tests, non-deterministic so testing for correctness and bitrate +# fluctuations - up to 1% bitrate fluctuation is allowed between runs +RaceHorses_416x240_30_10bit.yuv,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 +RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --vbv-bufsize 600 --vbv-maxrate 600 +RaceHorses_416x240_30_10bit.yuv,--preset veryslow --bitrate 1100 --vbv-bufsize 1100 --vbv-maxrate 1200 +112_1920x1080_25.yuv,--preset medium --bitrate 1000 --vbv-maxrate 1500 --vbv-bufsize 1500 --aud +112_1920x1080_25.yuv,--preset medium --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd +112_1920x1080_25.yuv,--preset medium --bitrate 4000 --vbv-maxrate 12000 --vbv-bufsize 12000 --repeat-headers +112_1920x1080_25.yuv,--preset superfast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 1500 --hrd --strict-cbr +112_1920x1080_25.yuv,--preset superfast --bitrate 30000 --vbv-maxrate 30000 --vbv-bufsize 30000 --repeat-headers +112_1920x1080_25.yuv,--preset superfast --bitrate 4000 --vbv-maxrate 6000 --vbv-bufsize 6000 --aud +112_1920x1080_25.yuv,--preset veryslow --bitrate 1000 --vbv-maxrate 3000 --vbv-bufsize 3000 --repeat-headers +big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --vbv-bufsize 3000 --vbv-maxrate 3000 --repeat-headers +big_buck_bunny_360p24.y4m,--preset medium --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd +big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud +big_buck_bunny_360p24.y4m,--preset medium --crf 1 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd +big_buck_bunny_360p24.y4m,--preset superfast --bitrate 1000 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud --strict-cbr +big_buck_bunny_360p24.y4m,--preset superfast --bitrate 3000 --vbv-bufsize 9000 --vbv-maxrate 9000 --repeat-headers +big_buck_bunny_360p24.y4m,--preset superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd +big_buck_bunny_360p24.y4m,--preset superfast --crf 6 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud + +# multi-pass rate control tests +big_buck_bunny_360p24.y4m,--preset slow --crf 40 --pass 1,--preset slow --bitrate 200 --pass 2 +big_buck_bunny_360p24.y4m,--preset medium --bitrate 700 --pass 1 -F4 --slow-firstpass,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 --pass 2 -F4 +112_1920x1080_25.yuv,--preset slow --bitrate 1000 --pass 1 -F4,--preset slow --bitrate 1000 --pass 2 -F4 +112_1920x1080_25.yuv,--preset superfast --crf 12 --pass 1,--preset superfast --bitrate 4000 --pass 2 -F4 +RaceHorses_416x240_30_10bit.yuv,--preset veryslow --crf 40 --pass 1, --preset veryslow --bitrate 200 --pass 2 -F4 +RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 600 --pass 2 -F4 +RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --pass 1,--preset medium --bitrate 500 --pass 3 -F4,--preset medium --bitrate 500 --pass 2 -F4
View file
x265_1.6.tar.gz/source/test/regression-tests.txt
Added
@@ -0,0 +1,127 @@ +# List of command lines to be run by regression tests, see https://bitbucket.org/sborho/test-harness + +# the vast majority of the commands are tested for results matching the +# most recent commit which was known to change outputs. The output +# bitstream must be bit-exact or the test fails. If no golden outputs +# are available the bitstream is validated (decoded) and then saved as a +# new golden output + +# Note: --nr-intra, --nr-inter, and --bitrate (ABR) give different +# outputs for different frame encoder counts. In order for outputs to be +# consistent across many machines, you must force a certain -FN so it is +# not auto-detected. + +BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190 +BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 +BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao +BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 +BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0 +BasketballDrive_1920x1080_50.y4m,--preset superfast --psy-rd 1 --ctu 16 --no-wpp +BasketballDrive_1920x1080_50.y4m,--preset ultrafast --signhide --colormatrix bt709 +BasketballDrive_1920x1080_50.y4m,--preset veryfast --tune zerolatency --no-temporal-mvp +BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode +Coastguard-4k.y4m,--preset medium --rdoq-level 1 --tune ssim --no-signhide --me umh +Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1 +Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop +CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --aq-mode 0 --sar 2 --range full +CrowdRun_1920x1080_50_10bit_422.yuv,--preset faster --max-tu-size 4 --min-cu-size 32 +CrowdRun_1920x1080_50_10bit_422.yuv,--preset medium --no-wpp --no-cutree --no-strong-intra-smoothing +CrowdRun_1920x1080_50_10bit_422.yuv,--preset slow --no-wpp --tune ssim --transfer smpte240m +CrowdRun_1920x1080_50_10bit_422.yuv,--preset slower --tune ssim --tune fastdecode +CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao +CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency +CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers --tune grain +CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1 +CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd +CrowdRun_1920x1080_50_10bit_444.yuv,--preset ultrafast --weightp --no-wpp --no-open-gop +CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers +CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16 +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp +DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq +DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0 +DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4 +FourPeople_1280x720_60.y4m,--preset medium --qp 38 --no-psy-rd +FourPeople_1280x720_60.y4m,--preset superfast --no-wpp --lookahead-slices 2 +Keiba_832x480_30.y4m,--preset medium --pmode --tune grain +Keiba_832x480_30.y4m,--preset slower --fast-intra --nr-inter 500 -F4 +Keiba_832x480_30.y4m,--preset superfast --no-fast-intra --nr-intra 1000 -F4 +Kimono1_1920x1080_24_10bit_444.yuv,--preset medium --min-cu-size 32 +Kimono1_1920x1080_24_10bit_444.yuv,--preset superfast --weightb +KristenAndSara_1280x720_60.y4m,--preset medium --no-cutree --max-tu-size 16 +KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8 +KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16 +KristenAndSara_1280x720_60.y4m,--preset ultrafast --strong-intra-smoothing +NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain +NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr +News-4k.y4m,--preset medium --tune ssim --no-sao +News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0 +OldTownCross_1920x1080_50_10bit_422.yuv,--preset medium --no-weightp +OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune fastdecode +OldTownCross_1920x1080_50_10bit_422.yuv,--preset superfast --weightp +ParkScene_1920x1080_24.y4m,--preset medium --qp 40 --rdpenalty 2 --tu-intra-depth 3 +ParkScene_1920x1080_24.y4m,--preset slower --no-weightp +ParkScene_1920x1080_24_10bit_444.yuv,--preset superfast --weightp --lookahead-slices 4 +RaceHorses_416x240_30.y4m,--preset medium --tskip-fast --tskip +RaceHorses_416x240_30.y4m,--preset slower --keyint -1 --rdoq-level 0 +RaceHorses_416x240_30.y4m,--preset superfast --no-cutree +RaceHorses_416x240_30.y4m,--preset veryslow --tskip-fast --tskip +RaceHorses_416x240_30_10bit.yuv,--preset fast --lookahead-slices 2 --b-intra +RaceHorses_416x240_30_10bit.yuv,--preset faster --rdoq-level 0 --dither +RaceHorses_416x240_30_10bit.yuv,--preset slow --tune grain +RaceHorses_416x240_30_10bit.yuv,--preset ultrafast --tune psnr +RaceHorses_416x240_30_10bit.yuv,--preset veryfast --weightb +RaceHorses_416x240_30_10bit.yuv,--preset placebo +SteamLocomotiveTrain_2560x1600_60_10bit_crop.yuv,--preset medium --dither +big_buck_bunny_360p24.y4m,--preset faster --keyint 240 --min-keyint 60 --rc-lookahead 200 +big_buck_bunny_360p24.y4m,--preset medium --keyint 60 --min-keyint 48 --weightb +big_buck_bunny_360p24.y4m,--preset slow --psy-rdoq 2.0 --rdoq-level 1 --no-b-intra +big_buck_bunny_360p24.y4m,--preset superfast --psy-rdoq 2.0 +big_buck_bunny_360p24.y4m,--preset ultrafast --deblock=2 +big_buck_bunny_360p24.y4m,--preset veryfast --no-deblock +city_4cif_60fps.y4m,--preset medium --crf 4 --cu-lossless --sao-non-deblock +city_4cif_60fps.y4m,--preset superfast --rdpenalty 1 --tu-intra-depth 2 +city_4cif_60fps.y4m,--preset slower --scaling-list default +city_4cif_60fps.y4m,--preset veryslow --rdpenalty 2 --sao-non-deblock --no-b-intra +ducks_take_off_420_720p50.y4m,--preset fast --deblock 6 --bframes 16 --rc-lookahead 40 +ducks_take_off_420_720p50.y4m,--preset faster --qp 24 --deblock -6 +ducks_take_off_420_720p50.y4m,--preset medium --tskip --tskip-fast --constrained-intra +ducks_take_off_420_720p50.y4m,--preset slow --scaling-list default --qp 40 +ducks_take_off_420_720p50.y4m,--preset ultrafast --constrained-intra --rd 1 +ducks_take_off_420_720p50.y4m,--preset veryslow --constrained-intra --bframes 2 +ducks_take_off_444_720p50.y4m,--preset medium --qp 38 --no-scenecut +ducks_take_off_444_720p50.y4m,--preset superfast --weightp --rd 0 +ducks_take_off_444_720p50.y4m,--preset slower --psy-rd 1 --psy-rdoq 2.0 --rdoq-level 1 +mobile_calendar_422_ntsc.y4m,--preset medium --bitrate 500 -F4 +mobile_calendar_422_ntsc.y4m,--preset slower --tskip --tskip-fast +mobile_calendar_422_ntsc.y4m,--preset superfast --weightp --rd 0 +mobile_calendar_422_ntsc.y4m,--preset veryslow --tskip +old_town_cross_444_720p50.y4m,--preset faster --rd 1 --tune zero-latency +old_town_cross_444_720p50.y4m,--preset medium --keyint -1 --no-weightp --ref 6 +old_town_cross_444_720p50.y4m,--preset slow --rdoq-level 1 --early-skip --ref 7 --no-b-pyramid +old_town_cross_444_720p50.y4m,--preset slower --crf 4 --cu-lossless +old_town_cross_444_720p50.y4m,--preset superfast --weightp --min-cu 16 +old_town_cross_444_720p50.y4m,--preset ultrafast --weightp --min-cu 32 +old_town_cross_444_720p50.y4m,--preset veryfast --qp 1 --tune ssim +parkrun_ter_720p50.y4m,--preset medium --no-open-gop --sao-non-deblock --crf 4 --cu-lossless +parkrun_ter_720p50.y4m,--preset slower --fast-intra --no-rect --tune grain +silent_cif_420.y4m,--preset medium --me full --rect --amp +silent_cif_420.y4m,--preset superfast --weightp --rect +silent_cif_420.y4m,--preset placebo --ctu 32 --no-sao +vtc1nw_422_ntsc.y4m,--preset medium --scaling-list default --ctu 16 --ref 5 +vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode +vtc1nw_422_ntsc.y4m,--preset superfast --weightp --nr-intra 100 -F4 +washdc_422_ntsc.y4m,--preset faster --rdoq-level 1 --max-merge 5 +washdc_422_ntsc.y4m,--preset medium --no-weightp --max-tu-size 4 +washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2 +washdc_422_ntsc.y4m,--preset superfast --psy-rd 1 --tune zerolatency +washdc_422_ntsc.y4m,--preset ultrafast --weightp --tu-intra-depth 4 +washdc_422_ntsc.y4m,--preset veryfast --tu-inter-depth 4 +washdc_422_ntsc.y4m,--preset veryslow --crf 4 --cu-lossless + +# interlace test, even though input YUV is not field seperated +CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --interlace bff +CrowdRun_1920x1080_50_10bit_422.yuv,--preset faster --interlace tff + +# vim: tw=200
View file
x265_1.6.tar.gz/source/test/smoke-tests.txt
Added
@@ -0,0 +1,17 @@ +# List of command lines to be run by smoke tests, see https://bitbucket.org/sborho/test-harness + +big_buck_bunny_360p24.y4m,--preset=superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd --aud --repeat-headers +big_buck_bunny_360p24.y4m,--preset=medium --bitrate 1000 -F4 --cu-lossless --scaling-list default +big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --cu-stats --pme +washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1 +washdc_422_ntsc.y4m,--preset=medium --qp 40 --nr-inter 400 -F4 +washdc_422_ntsc.y4m,--preset=veryslow --pmode --tskip --rdoq-level 0 +old_town_cross_444_720p50.y4m,--preset=ultrafast --weightp --keyint -1 +old_town_cross_444_720p50.y4m,--preset=fast --keyint 20 --min-cu-size 16 +old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode +RaceHorses_416x240_30_10bit.yuv,--preset=veryfast --cu-stats --max-tu-size 8 +RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1 +CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --constrained-intra --min-keyint 5 --keyint 10 +CrowdRun_1920x1080_50_10bit_444.yuv,--preset=medium --max-tu-size 16 +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=veryfast --min-cu 16 +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=fast --weightb --interlace bff
View file
x265_1.5.tar.gz/source/test/testbench.cpp -> x265_1.6.tar.gz/source/test/testbench.cpp
Changed
@@ -174,7 +174,10 @@ for (int i = 0; test_arch[i].flag; i++) { if (test_arch[i].flag & cpuid) + { printf("Testing primitives: %s\n", test_arch[i].name); + fflush(stdout); + } else continue; @@ -188,6 +191,7 @@ continue; if (!harness[h]->testCorrectness(cprim, vecprim)) { + fflush(stdout); fprintf(stderr, "\nx265: intrinsic primitive has failed. Go and fix that Right Now!\n"); return -1; } @@ -204,6 +208,7 @@ continue; if (!harness[h]->testCorrectness(cprim, asmprim)) { + fflush(stdout); fprintf(stderr, "\nx265: asm primitive has failed. Go and fix that Right Now!\n"); return -1; } @@ -226,6 +231,7 @@ memcpy(&primitives, &optprim, sizeof(EncoderPrimitives)); printf("\nTest performance improvement with full optimizations\n"); + fflush(stdout); for (size_t h = 0; h < sizeof(harness) / sizeof(TestHarness*); h++) {
View file
x265_1.5.tar.gz/source/test/testharness.h -> x265_1.6.tar.gz/source/test/testharness.h
Changed
@@ -158,7 +158,7 @@ m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, \ m_rand, m_rand, m_rand, m_rand, m_rand), /* max_args+6 */ \ x265_checkasm_call_float((float(*)())func, &m_ok, 0, 0, 0, 0, __VA_ARGS__)) -#define reportfail() if (!m_ok) { fprintf(stderr, "stack clobber check failed at %s:%d", __FILE__, __LINE__); abort(); } +#define reportfail() if (!m_ok) { fflush(stdout); fprintf(stderr, "stack clobber check failed at %s:%d", __FILE__, __LINE__); abort(); } #elif ARCH_X86 #define checked(func, ...) x265_checkasm_call((intptr_t(*)())func, &m_ok, __VA_ARGS__); #define checked_float(func, ...) x265_checkasm_call_float((float(*)())func, &m_ok, __VA_ARGS__);
View file
x265_1.5.tar.gz/source/x265.cpp -> x265_1.6.tar.gz/source/x265.cpp
Changed
@@ -147,6 +147,7 @@ if (!bProgress || !frameNum || (prevUpdateTime && time - prevUpdateTime < UPDATE_INTERVAL)) return; + int64_t elapsed = time - startTime; double fps = elapsed > 0 ? frameNum * 1000000. / elapsed : 0; float bitrate = 0.008f * totalbytes * (param->fpsNum / param->fpsDenom) / ((float)frameNum); @@ -158,9 +159,8 @@ eta / 3600, (eta / 60) % 60, eta % 60); } else - { sprintf(buf, "x265 %d frames: %.2f fps, %.2f kb/s", frameNum, fps, bitrate); - } + fprintf(stderr, "%s \r", buf + 5); SetConsoleTitle(buf); fflush(stderr); // needed in windows @@ -530,7 +530,7 @@ while (pic_in && !b_ctrl_c) { pic_orig.poc = inFrameCount; - if (cliopt.qpfile && !param->rc.bStatRead) + if (cliopt.qpfile) { if (!cliopt.parseQPFile(pic_orig)) {
View file
x265_1.5.tar.gz/source/x265.def.in -> x265_1.6.tar.gz/source/x265.def.in
Changed
@@ -1,6 +1,5 @@ EXPORTS x265_encoder_open_${X265_BUILD} -x265_setup_primitives x265_param_default x265_param_default_preset x265_param_parse @@ -20,3 +19,4 @@ x265_encoder_log x265_encoder_close x265_cleanup +x265_api_get_${X265_BUILD}
View file
x265_1.5.tar.gz/source/x265.h -> x265_1.6.tar.gz/source/x265.h
Changed
@@ -91,19 +91,31 @@ /* Stores all analysis data for a single frame */ typedef struct x265_analysis_data { + void* interData; + void* intraData; uint32_t frameRecordSize; - int32_t poc; - int32_t sliceType; + uint32_t poc; + uint32_t sliceType; uint32_t numCUsInFrame; uint32_t numPartitions; - void* interData; - void* intraData; } x265_analysis_data; /* Used to pass pictures into the encoder, and to get picture data back out of * the encoder. The input and output semantics are different */ typedef struct x265_picture { + /* presentation time stamp: user-specified, returned on output */ + int64_t pts; + + /* display time stamp: ignored on input, copied from reordered pts. Returned + * on output */ + int64_t dts; + + /* force quantizer for != X265_QP_AUTO */ + /* The value provided on input is returned with the same picture (POC) on + * output */ + void* userData; + /* Must be specified on input pictures, the number of planes is determined * by the colorSpace value */ void* planes[3]; @@ -132,18 +144,8 @@ * initialize this value to the internal color space */ int colorSpace; - /* presentation time stamp: user-specified, returned on output */ - int64_t pts; - - /* display time stamp: ignored on input, copied from reordered pts. Returned - * on output */ - int64_t dts; - - /* The value provided on input is returned with the same picture (POC) on - * output */ - void* userData; - - /* force quantizer for != X265_QP_AUTO */ + /* Force the slice base QP for this picture within the encoder. Set to 0 + * to allow the encoder to determine base QP */ int forceqp; /* If param.analysisMode is X265_ANALYSIS_OFF this field is ignored on input @@ -159,8 +161,6 @@ * this data structure */ x265_analysis_data analysisData; - /* new data members to this structure must be added to the end so that - * users of x265_picture_alloc/free() can be assured of future safety */ } x265_picture; typedef enum @@ -229,7 +229,11 @@ #define X265_B_ADAPT_FAST 1 #define X265_B_ADAPT_TRELLIS 2 +#define X265_REF_LIMIT_DEPTH 1 +#define X265_REF_LIMIT_CU 2 + #define X265_BFRAME_MAX 16 +#define X265_MAX_FRAME_THREADS 16 #define X265_TYPE_AUTO 0x0000 /* Let x265 choose the right type */ #define X265_TYPE_IDR 0x0001 @@ -237,13 +241,14 @@ #define X265_TYPE_P 0x0003 #define X265_TYPE_BREF 0x0004 /* Non-disposable B-frame */ #define X265_TYPE_B 0x0005 +#define IS_X265_TYPE_I(x) ((x) == X265_TYPE_I || (x) == X265_TYPE_IDR) +#define IS_X265_TYPE_B(x) ((x) == X265_TYPE_B || (x) == X265_TYPE_BREF) + #define X265_QP_AUTO 0 #define X265_AQ_NONE 0 #define X265_AQ_VARIANCE 1 #define X265_AQ_AUTO_VARIANCE 2 -#define IS_X265_TYPE_I(x) ((x) == X265_TYPE_I || (x) == X265_TYPE_IDR) -#define IS_X265_TYPE_B(x) ((x) == X265_TYPE_B || (x) == X265_TYPE_BREF) /* NOTE! For this release only X265_CSP_I420 and X265_CSP_I444 are supported */ @@ -308,11 +313,9 @@ double elapsedEncodeTime; /* wall time since encoder was opened */ double elapsedVideoTime; /* encoded picture count / frame rate */ double bitrate; /* accBits / elapsed video time */ + uint64_t accBits; /* total bits output thus far */ uint32_t encodedPictureCount; /* number of output pictures thus far */ uint32_t totalWPFrames; /* number of uni-directional weighted frames used */ - uint64_t accBits; /* total bits output thus far */ - - /* new statistic member variables must be added below this line */ } x265_stats; /* String values accepted by x265_param_parse() (and CLI) for various parameters */ @@ -322,7 +325,8 @@ static const char * const x265_fullrange_names[] = { "limited", "full", 0 }; static const char * const x265_colorprim_names[] = { "", "bt709", "undef", "", "bt470m", "bt470bg", "smpte170m", "smpte240m", "film", "bt2020", 0 }; static const char * const x265_transfer_names[] = { "", "bt709", "undef", "", "bt470m", "bt470bg", "smpte170m", "smpte240m", "linear", "log100", - "log316", "iec61966-2-4", "bt1361e", "iec61966-2-1", "bt2020-10", "bt2020-12", 0 }; + "log316", "iec61966-2-4", "bt1361e", "iec61966-2-1", "bt2020-10", "bt2020-12", + "smpte-st-2084", "smpte-st-428", 0 }; static const char * const x265_colmatrix_names[] = { "GBR", "bt709", "undef", "", "fcc", "bt470bg", "smpte170m", "smpte240m", "YCgCo", "bt2020nc", "bt2020c", 0 }; static const char * const x265_sar_names[] = { "undef", "1:1", "12:11", "10:11", "16:11", "40:33", "24:11", "20:11", @@ -334,9 +338,9 @@ * If zones overlap, whichever comes later in the list takes precedence. */ typedef struct x265_zone { - int startFrame, endFrame; /* range of frame numbers */ - int bForceQp; /* whether to use qp vs bitrate factor */ - int qp; + int startFrame, endFrame; /* range of frame numbers */ + int bForceQp; /* whether to use qp vs bitrate factor */ + int qp; float bitrateFactor; } x265_zone; @@ -348,36 +352,77 @@ * x265_param as an opaque data structure */ typedef struct x265_param { - /*== Encoder Environment ==*/ - /* x265_param_default() will auto-detect this cpu capability bitmap. it is * recommended to not change this value unless you know the cpu detection is * somehow flawed on your target hardware. The asm function tables are * process global, the first encoder configures them for all encoders */ int cpuid; + /*== Parallelism Features ==*/ + + /* Number of concurrently encoded frames between 1 and X265_MAX_FRAME_THREADS + * or 0 for auto-detection. By default x265 will use a number of frame + * threads empirically determined to be optimal for your CPU core count, + * between 2 and 6. Using more than one frame thread causes motion search + * in the down direction to be clamped but otherwise encode behavior is + * unaffected. With CQP rate control the output bitstream is deterministic + * for all values of frameNumThreads greater than 1. All other forms of + * rate-control can be negatively impacted by increases to the number of + * frame threads because the extra concurrency adds uncertainty to the + * bitrate estimations. Frame parallelism is generally limited by the the + * is generally limited by the the number of CU rows + * + * When thread pools are used, each frame thread is assigned to a single + * pool and the frame thread itself is given the node affinity of its pool. + * But when no thread pools are used no node affinity is assigned. */ + int frameNumThreads; + + /* Comma seperated list of threads per NUMA node. If "none", then no worker + * pools are created and only frame parallelism is possible. If NULL or "" + * (default) x265 will use all available threads on each NUMA node. + * + * '+' is a special value indicating all cores detected on the node + * '*' is a special value indicating all cores detected on the node and all + * remaining nodes. + * '-' is a special value indicating no cores on the node, same as '0' + * + * example strings for a 4-node system: + * "" - default, unspecified, all numa nodes are used for thread pools + * "*" - same as default + * "none" - no thread pools are created, only frame parallelism possible + * "-" - same as "none" + * "10" - allocate one pool, using up to 10 cores on node 0 + * "-,+" - allocate one pool, using all cores on node 1 + * "+,-,+" - allocate two pools, using all cores on nodes 0 and 2 + * "+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2 + * "-,*" - allocate three pools, using all cores on nodes 1, 2 and 3 + * "8,8,8,8" - allocate four pools with up to 8 threads in each pool + * + * The total number of threads will be determined by the number of threads + * assigned to all nodes. The worker threads will each be given affinity for + * their node, they will not be allowed to migrate between nodes, but they + * will be allowed to move between CPU cores within their node. + * + * If the three pool features: bEnableWavefront, bDistributeModeAnalysis and + * bDistributeMotionEstimation are all disabled, then numaPools is ignored + * and no thread pools are created. + * + * If "none" is specified, then all three of the thread pool features are + * implicitly disabled. + * + * Multiple thread pools will be allocated for any NUMA node with more than + * 64 logical CPU cores. But any given thread pool will always use at most + * one NUMA node. + * + * Frame encoders are distributed between the available thread pools, and
View file
x265_1.5.tar.gz/source/x265cli.h -> x265_1.6.tar.gz/source/x265cli.h
Changed
@@ -37,7 +37,8 @@ { "version", no_argument, NULL, 'V' }, { "asm", required_argument, NULL, 0 }, { "no-asm", no_argument, NULL, 0 }, - { "threads", required_argument, NULL, 0 }, + { "pools", required_argument, NULL, 0 }, + { "numa-pools", required_argument, NULL, 0 }, { "preset", required_argument, NULL, 'p' }, { "tune", required_argument, NULL, 't' }, { "frame-threads", required_argument, NULL, 'F' }, @@ -71,6 +72,8 @@ { "no-wpp", no_argument, NULL, 0 }, { "wpp", no_argument, NULL, 0 }, { "ctu", required_argument, NULL, 's' }, + { "min-cu-size", required_argument, NULL, 0 }, + { "max-tu-size", required_argument, NULL, 0 }, { "tu-intra-depth", required_argument, NULL, 0 }, { "tu-inter-depth", required_argument, NULL, 0 }, { "me", required_argument, NULL, 0 }, @@ -96,6 +99,8 @@ { "no-cu-lossless", no_argument, NULL, 0 }, { "no-constrained-intra", no_argument, NULL, 0 }, { "constrained-intra", no_argument, NULL, 0 }, + { "cip", no_argument, NULL, 0 }, + { "no-cip", no_argument, NULL, 0 }, { "fast-intra", no_argument, NULL, 0 }, { "no-fast-intra", no_argument, NULL, 0 }, { "no-open-gop", no_argument, NULL, 0 }, @@ -105,6 +110,7 @@ { "scenecut", required_argument, NULL, 0 }, { "no-scenecut", no_argument, NULL, 0 }, { "rc-lookahead", required_argument, NULL, 0 }, + { "lookahead-slices", required_argument, NULL, 0 }, { "bframes", required_argument, NULL, 'b' }, { "bframe-bias", required_argument, NULL, 0 }, { "b-adapt", required_argument, NULL, 0 }, @@ -136,6 +142,8 @@ { "cbqpoffs", required_argument, NULL, 0 }, { "crqpoffs", required_argument, NULL, 0 }, { "rd", required_argument, NULL, 0 }, + { "rdoq-level", required_argument, NULL, 0 }, + { "no-rdoq-level", no_argument, NULL, 0 }, { "psy-rd", required_argument, NULL, 0 }, { "psy-rdoq", required_argument, NULL, 0 }, { "no-psy-rd", no_argument, NULL, 0 }, @@ -195,6 +203,8 @@ { "analysis-mode", required_argument, NULL, 0 }, { "analysis-file", required_argument, NULL, 0 }, { "strict-cbr", no_argument, NULL, 0 }, + { "temporal-layers", no_argument, NULL, 0 }, + { "no-temporal-layers", no_argument, NULL, 0 }, { 0, 0, 0, 0 }, { 0, 0, 0, 0 }, { 0, 0, 0, 0 }, @@ -246,10 +256,11 @@ H0(" --[no-]psnr Enable reporting PSNR metric scores. Default %s\n", OPT(param->bEnablePsnr)); H0("\nProfile, Level, Tier:\n"); H0(" --profile <string> Enforce an encode profile: main, main10, mainstillpicture\n"); - H0(" --level-idc <integer|float> Force a minumum required decoder level (as '5.0' or '50')\n"); + H0(" --level-idc <integer|float> Force a minimum required decoder level (as '5.0' or '50')\n"); H0(" --[no-]high-tier If a decoder level is specified, this modifier selects High tier of that level\n"); H0("\nThreading, performance:\n"); - H0(" --threads <integer> Number of threads for thread pool (0: detect CPU core count, default)\n"); + H0(" --pools <integer,...> Comma separated thread count per thread pool (pool per NUMA node)\n"); + H0(" '-' implies no threads on node, '+' implies one thread per core on node\n"); H0("-F/--frame-threads <integer> Number of concurrently encoded frames. 0: auto-determined by core count\n"); H0(" --[no-]wpp Enable Wavefront Parallel Processing. Default %s\n", OPT(param->bEnableWavefront)); H0(" --[no-]pmode Parallel mode analysis. Default %s\n", OPT(param->bDistributeModeAnalysis)); @@ -262,14 +273,16 @@ H0(" psnr, ssim, grain, zerolatency, fastdecode\n"); H0("\nQuad-Tree size and depth:\n"); H0("-s/--ctu <64|32|16> Maximum CU size (WxH). Default %d\n", param->maxCUSize); + H0(" --min-cu-size <64|32|16|8> Minimum CU size (WxH). Default %d\n", param->minCUSize); + H0(" --max-tu-size <32|16|8|4> Maximum TU size (WxH). Default %d\n", param->maxTUSize); H0(" --tu-intra-depth <integer> Max TU recursive depth for intra CUs. Default %d\n", param->tuQTMaxIntraDepth); H0(" --tu-inter-depth <integer> Max TU recursive depth for inter CUs. Default %d\n", param->tuQTMaxInterDepth); H0("\nAnalysis:\n"); - H0(" --rd <0..6> Level of RD in mode decision 0:least....6:full RDO. Default %d\n", param->rdLevel); + H0(" --rd <0..6> Level of RDO in mode decision 0:least....6:full RDO. Default %d\n", param->rdLevel); H0(" --[no-]psy-rd <0..2.0> Strength of psycho-visual rate distortion optimization, 0 to disable. Default %.1f\n", param->psyRd); - H0(" --[no-]psy-rdoq <0..50.0> Strength of psycho-visual optimization in quantization, 0 to disable. Default %.1f\n", param->psyRdoq); + H0(" --[no-]rdoq-level <0|1|2> Level of RDO in quantization 0:none, 1:levels, 2:levels & coding groups. Default %d\n", param->rdoqLevel); + H0(" --[no-]psy-rdoq <0..50.0> Strength of psycho-visual optimization in RDO quantization, 0 to disable. Default %.1f\n", param->psyRdoq); H0(" --[no-]early-skip Enable early SKIP detection. Default %s\n", OPT(param->bEnableEarlySkip)); - H1(" --[no-]fast-cbf Enable early outs based on whether residual is coded. Default %s\n", OPT(param->bEnableCbfFastMode)); H1(" --[no-]tskip-fast Enable fast intra transform skipping. Default %s\n", OPT(param->bEnableTSkipFast)); H1(" --nr-intra <integer> An integer value in range of 0 to 2000, which denotes strength of noise reduction in intra CUs. Default 0\n"); H1(" --nr-inter <integer> An integer value in range of 0 to 2000, which denotes strength of noise reduction in inter CUs. Default 0\n"); @@ -300,6 +313,7 @@ H0(" --no-scenecut Disable adaptive I-frame decision\n"); H0(" --scenecut <integer> How aggressively to insert extra I-frames. Default %d\n", param->scenecutThreshold); H0(" --rc-lookahead <integer> Number of frames for frame-type lookahead (determines encoder latency) Default %d\n", param->lookaheadDepth); + H1(" --lookahead-slices <0..16> Number of slices to use per lookahead cost estimate. Default %d\n", param->lookaheadSlices); H0(" --bframes <integer> Maximum number of consecutive b-frames (now it only enables B GOP structure) Default %d\n", param->bframes); H1(" --bframe-bias <integer> Bias towards B frame decisions. Default %d\n", param->bFrameBias); H0(" --b-adapt <0..2> 0 - none, 1 - fast, 2 - full (trellis) adaptive B frame scheduling. Default %d\n", param->bFrameAdaptive); @@ -371,10 +385,11 @@ H1(" smpte240m, GBR, YCgCo, bt2020nc, bt2020c. Default undef\n"); H1(" --chromaloc <integer> Specify chroma sample location (0 to 5). Default of %d\n", param->vui.chromaSampleLocTypeTopField); H0("\nBitstream options:\n"); + H0(" --[no-]repeat-headers Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders)); H0(" --[no-]info Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI)); - H0(" --[no-]aud Emit access unit delimiters at the start of each access unit. Default %s\n", OPT(param->bEnableAccessUnitDelimiters)); H0(" --[no-]hrd Enable HRD parameters signaling. Default %s\n", OPT(param->bEmitHRDSEI)); - H0(" --[no-]repeat-headers Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders)); + H0(" --[no-]temporal-layers Enable a temporal sublayer for unreferenced B frames. Default %s\n", OPT(param->bEnableTemporalSubLayers)); + H0(" --[no-]aud Emit access unit delimiters at the start of each access unit. Default %s\n", OPT(param->bEnableAccessUnitDelimiters)); H1(" --hash <integer> Decoded Picture Hash SEI 0: disabled, 1: MD5, 2: CRC, 3: Checksum. Default %d\n", param->decodedPictureHashSEI); H1("\nReconstructed video options (debugging):\n"); H1("-r/--recon <filename> Reconstructed raw image YUV or Y4M output file name\n");
Locations
Projects
Search
Status Monitor
Help
Open Build Service
OBS Manuals
API Documentation
OBS Portal
Reporting a Bug
Contact
Mailing List
Forums
Chat (IRC)
Twitter
Open Build Service (OBS)
is an
openSUSE project
.