audio data bandwidth

notes

audio scale compression

soft-clipping 20 bit to 16 bit, or 24 to 20, or 12 to 8

Music-power ratings on audio equipments are always much lower than RMS (root-mean-square) power ratings on the same, because music, and especially vocal, exhibits extreme instantaneous power excursions: as much as 20 decibels, a factor of 10x, due to compounded sinusoids. Clipping these instantaneous excursions, which may last but a sample in 20, produces distinct 'clipping' noise in the output expectations: it's the equivalent of occasionally adding-in strong negative spikes. Standard audio CD's today use 20-bit data to avoid this clipping - some use 24 - but rarely achieve their maximum goal: the top 4-5 bits of digital amplitude expression must be reserved for such peaks, the next 5-6 bits are 15 db of amplitude ranging, the last 10, purity: totalling 60db SNR: While 8-bit produces discernable music and voice, and is used on the internet for news and music-sampling, it is under utilized, and heavily clipped: as shows notably on a scope. 20-bit floating-point is digitally more efficient than 20-bit fixed-point (which has an advantage of extreme purity when unclipped), but play-back equipment is range-limited. The lower audio spectrum is also advantaged by cumulative sample averaging in 20-bit fixed - though it, too, can be improved by selectively rounding-to-digital.

A way around this consequential clipping, requires pre-filtering with non-monotonic frequency-phase delays, which re-phase audio frequencies variously, such that harmonics become phase-distributed, and not as multiply peak-coincident: component signal frequencies amplitudes peaks are pseudo-randomly repositioned, dropping peaks into near-phase valleys: There are many harmonics in typical voice - 6-or-more for a pitch at middle 'C' - and the spread can be quite entropic, and effectual - a more computed efficiency could be gained by selectively adjusting phase-delay for optimal peak-flattening. Another remedial is clip-smoothing, both high bits and low---when the signal misses target value it should add the missing portion in the adjacent samples. (This is remedial even as an erasure sample is usually taken as the adjacent value on playback.) [Similarly, in coding, a quantum transition should indicate an unchanged value that averages between the two---this is rare except at the lowest recorded amplitudes.

(Note that adding signal clippage to adjacent samples, approximates a localized low pass filter maintaining lower frequencies cleaner, and higher momentarily reduced in amplitude.)

Reduction:

Digital signal compression would recover most of the 4-bit-loss: variable-to-fixed-length data reduction could compress most of 20-bit into 16-bit: but with no guarantee, and the digitization is not compatible with extant players, and increases computation loading.

Alternatively we can code the audio directly without digital reduction encoding, rather by special-case compatibly coding the clipped values: Soft-clipping, replaces only clipped peak values (which play-back clipped in extant players anyway) with unclipped but truncated values: Each clipped value is re-represented as an quasi-exponented value: We preclude the top 1/16th or less of the available digital amplitudes, to re-code these into quickly scaled-up values (thereby soft-clipped) coarsely truncated in the lowest digits: thereby reducing effective clip-amplitude by several digital orders of magnitude - compatibly.

Whenever the (music) input-signal to be digitized exceeds the clipping threshold, the excess is coded scaled-up, truncated: implementations include 24-bit A/D recoded thiswise onto 20-bit soft-clip, or paired A/D channels of distinct scales with their common decision switch-over and digital representation merge above its threshold. Compatibly, such soft-clipped music played-back on a regular CD player without softclip has slightly rounded lead and trailing clip-edges (of no consequence). Play-back in a soft-clipped CD player is implemented by a decision at the digital threshold to scale-up the digital excess remainder and add it to the (precision reference) threshold, to maintain compatibility.

An over-20-bit soft-clipped amplitude is represented digitally as a (scaled-up) 24-bit amplitude missing its lowest 8 bits: where the highest 4 bits of the 20-bit are the threshold, and the lower 16 bits are the excess to be scaled-up by 8 bit-places, then added onto it. A soft-clipped peak is thus quantized at 16 bits down or -48 db itself, but that's +24db (up) compared to the otherwise unclipped amplitudes. [Softer exponential scaling schemes may suffice some applications]

Similarly 8-bit audio can be soft-clipped, but for less improvement. A mathematically interesting illustration of 8-bit soft-clipping, plays, add-up the consecutive highest bits equaling 1 , scaling-up (shifting) until the highest is 0 , and adding (appending) the remainder: which is a discrete linear system approximation to logarithmic compression. The maximum peak value is 4x higher, and progressively coarser.

But another quasi-logarithmic compression scheme assigns the high bits to radix-two scale (exponent or "characteristic") and the low bits to precision-fragment (mantissa or "fraction"); the highest bit indicates negative. Zero-scale values assume precision as integer; switching to floating point with presumed leading-"1" at higher non-zero-scale (an inclusive-or of the scale bits).-- 8-bit partitioned 1+3+4, maintains accuracy within 1.6%-3% (nearest 1/32nd-63rd); maximum value touches 12-bit cumulative range, ±1984; and the 66 smallest values are or equal 6-bit integers, ±{0,,,32}.... It is coarsely similar to logarithmic ranging, (1.06){-12,,,115: 7-bit} = ±{0.5,,,800}, at constant average 3%. 10-bit, 1+4+5, touches 21-bit at 0.8%-1.6%: 215 scale, 5-bit fragment with presumed leading-"1"; etc.

(Quasi-logarithmic compressions belong to a mathematical class of human numbers.)

Where formant-specific pink-noise is permissible, as in communications by voice where pitch clarity does not change the meaning of words - (there being few linguistic [alphabetizable] sounds partitioning the formant-space, sonic-relative amplitude domain) - another compression scheme may be presently usable: selectively adding back a portion of the original signal around each sample, beginning with those most coarsely represented, the sampling inaccuracy can be reduced to improve 8-bit as though it were 9-bit with the lowest bit empirically always sampled as zero. The formants change unnoticeably to slightly off-color [technical meaning], and slightly raspier as the original is used only slightly enveloped: However, while the higher formants need only narrower normal-envelope, the lower pitch formants needs wider normal-envelop, and it may suffice to split the formants first, and adjust each separately, then added together, the lowest bit remains zeroed - for efficient transmission (no extra bit is sent).

[A similar scheme spreads quantization energy into adjacent samples, requiring simple re-quantizing, so as to retain the same total sound energy near-sample - less noticeably at low frequencies than highs]

[Yet other schemes adjust sample amplitudes slightly by phoneme-speed (very slow) modulator, so as to optimally re-scale sampled amplitudes to be more typically nearer quantization rungs]

Yet another adjusts sampling speed (by computing or also sampling the derivative) and allowing output quantization (optimized) to any value that also maintains the apparent frequency phase to less than 3% (that being less heard)---however, simply combining the exponential 3%-form for differential samples, accomplishes the same.

Yet another improvement is possible by pre-computing the wide-band filter quadrature (cosine for sine) of the to-be-recorded signal, and sampling between in-phase and quadrature where the maximum signal excursion is minimized, shifting only very slowly as to prevent frequency-chirping: this optimal position changes (slowly) with room acoustics and speaker phoneme formants.

logarithmic frequency analysis

Next to consider are the phoneme rate, 27 phonemes/second maximum ('A-27' is the lowest note on the piano), during which the formant frequency components are cumulated into one discernable linguistic (phoneme) sound, and the basic logarithmic frequency harmonic recognition (the notes of the piano are logarithmically spaced, 12 note-steps in every pitch doubling) - which in combination means, we can pre-spread phonemes by simple add/subtract vector hundreds of samples long (per half-phoneme) and then pack the frequency analysis processing into minimized sample intervals, shorter for higher frequencies. [SEE ALSO aurum logarithmic audio-spectrum analysis]

Speech recognition can then select in each half-phoneme interval, among 40-60 alphabetizable sounds (arbitrarily, preliminarily choosing English with further dialectic variants) - for a text-type linguistic data-rate of 330 bps (bits). [Text data maximum rate is half that, 165 bps, on full phonemes - about 150 baud standard teletype speed] Further reductions might be processed on the word structure, sound-to-sound chaining: the average word length is near 7 phonemes, 8, including the null-spacing, but except for near 20 vowel-consonant transitions, there are near 4 consonant-consonant transitions possible, except when distinguishing adjacent syllables. Further speech data reduction then processes on grammatical structure data.

Grand-Admiral Petry
'Majestic Service in a Solar System'
Nuclear Emergency Management

© 1999 GrandAdmiralPetry@Lanthus.net