Modulo:Utf8debug
[antaŭrigardi] [redakti] [historio] [renovigi]
Dokumentado
Utilo
Modulo por esplori signoĉenojn kaj videbligi nevideblajn signojn. Alvokenda nur el provejoj kaj diskutpaĝoj, ne el artikoloj.
Parametroj (1...3)
- (anonima kaj deviga) -- la esplorinda signoĉeno
- (nomita kaj nedeviga) outctl= -- tipo de rezulto, 4 ciferoj aŭ speciala valoro "nw" egala al "0010", defaŭlte "1101", maksimume "1131", valoro "0000" malpermesita
- "0" aŭ "1" -- montri grandecon en bitokoj
- "0" aŭ "1" -- montri grandajn skatolojn kun kodoj por unuopaj UTF8-aj signoj
- "0" ĝis "3" -- montri tekston per metodo "hard nowiki" sen skatoloj
- "0" aŭ "1" -- montri grandecon en UTF8-aj signoj
- (nomita kaj nedeviga) empsil=1 -- ne montru ruĝan skatolon okaze de malplena enigo, sed eligu malplenan signoĉenon
Ekzemplo
{{#invoke:Utf8debug|ek|AАΑBВΒCСEЕΕHНΗIΙKКΚMМΜNΝOОΟPРΡTТΤXХΧYУΥZΖaаcсeеoоοpрxхyу}}
- samaspektaj sed malsamaj literoj
number of octet:s : 90 |
index 0 beg code $41=#65 length 1 A |
index 1 beg code $D0=#208 length 2 extra $90 codepoint U+$0410 dec #1'040 А |
index 3 beg code $CE=#206 length 2 extra $91 codepoint U+$0391 dec #913 Α |
index 5 beg code $42=#66 length 1 B |
index 6 beg code $D0=#208 length 2 extra $92 codepoint U+$0412 dec #1'042 В |
index 8 beg code $CE=#206 length 2 extra $92 codepoint U+$0392 dec #914 Β |
index 10 beg code $43=#67 length 1 C |
index 11 beg code $D0=#208 length 2 extra $A1 codepoint U+$0421 dec #1'057 С |
index 13 beg code $45=#69 length 1 E |
index 14 beg code $D0=#208 length 2 extra $95 codepoint U+$0415 dec #1'045 Е |
index 16 beg code $CE=#206 length 2 extra $95 codepoint U+$0395 dec #917 Ε |
index 18 beg code $48=#72 length 1 H |
index 19 beg code $D0=#208 length 2 extra $9D codepoint U+$041D dec #1'053 Н |
index 21 beg code $CE=#206 length 2 extra $97 codepoint U+$0397 dec #919 Η |
index 23 beg code $49=#73 length 1 I |
index 24 beg code $CE=#206 length 2 extra $99 codepoint U+$0399 dec #921 Ι |
index 26 beg code $4B=#75 length 1 K |
index 27 beg code $D0=#208 length 2 extra $9A codepoint U+$041A dec #1'050 К |
index 29 beg code $CE=#206 length 2 extra $9A codepoint U+$039A dec #922 Κ |
index 31 beg code $4D=#77 length 1 M |
index 32 beg code $D0=#208 length 2 extra $9C codepoint U+$041C dec #1'052 М |
index 34 beg code $CE=#206 length 2 extra $9C codepoint U+$039C dec #924 Μ |
index 36 beg code $4E=#78 length 1 N |
index 37 beg code $CE=#206 length 2 extra $9D codepoint U+$039D dec #925 Ν |
index 39 beg code $4F=#79 length 1 O |
index 40 beg code $D0=#208 length 2 extra $9E codepoint U+$041E dec #1'054 О |
index 42 beg code $CE=#206 length 2 extra $9F codepoint U+$039F dec #927 Ο |
index 44 beg code $50=#80 length 1 P |
index 45 beg code $D0=#208 length 2 extra $A0 codepoint U+$0420 dec #1'056 Р |
index 47 beg code $CE=#206 length 2 extra $A1 codepoint U+$03A1 dec #929 Ρ |
index 49 beg code $54=#84 length 1 T |
index 50 beg code $D0=#208 length 2 extra $A2 codepoint U+$0422 dec #1'058 Т |
index 52 beg code $CE=#206 length 2 extra $A4 codepoint U+$03A4 dec #932 Τ |
index 54 beg code $58=#88 length 1 X |
index 55 beg code $D0=#208 length 2 extra $A5 codepoint U+$0425 dec #1'061 Х |
index 57 beg code $CE=#206 length 2 extra $A7 codepoint U+$03A7 dec #935 Χ |
index 59 beg code $59=#89 length 1 Y |
index 60 beg code $D0=#208 length 2 extra $A3 codepoint U+$0423 dec #1'059 У |
index 62 beg code $CE=#206 length 2 extra $A5 codepoint U+$03A5 dec #933 Υ |
index 64 beg code $5A=#90 length 1 Z |
index 65 beg code $CE=#206 length 2 extra $96 codepoint U+$0396 dec #918 Ζ |
index 67 beg code $61=#97 length 1 a |
index 68 beg code $D0=#208 length 2 extra $B0 codepoint U+$0430 dec #1'072 а |
index 70 beg code $63=#99 length 1 c |
index 71 beg code $D1=#209 length 2 extra $81 codepoint U+$0441 dec #1'089 с |
index 73 beg code $65=#101 length 1 e |
index 74 beg code $D0=#208 length 2 extra $B5 codepoint U+$0435 dec #1'077 е |
index 76 beg code $6F=#111 length 1 o |
index 77 beg code $D0=#208 length 2 extra $BE codepoint U+$043E dec #1'086 о |
index 79 beg code $CE=#206 length 2 extra $BF codepoint U+$03BF dec #959 ο |
index 81 beg code $70=#112 length 1 p |
index 82 beg code $D1=#209 length 2 extra $80 codepoint U+$0440 dec #1'088 р |
index 84 beg code $78=#120 length 1 x |
index 85 beg code $D1=#209 length 2 extra $85 codepoint U+$0445 dec #1'093 х |
index 87 beg code $79=#121 length 1 y |
index 88 beg code $D1=#209 length 2 extra $83 codepoint U+$0443 dec #1'091 у |
number of UTF8 char:s : 56 |
Se vi havas demandon pri ĉi tiu Lua-modulo, tiam vi povas demandi en la diskutejo pri Lua-moduloj. La Intervikiaj ligiloj estu metataj al Vikidatumoj. (Vidu Helpopaĝon pri tio.) |
|
--[===[
MODULE "UTF8DEBUG" (debug UTF8 text)
"eo.wiktionary.org/wiki/Modulo:utf8debug" <!--2023-Nov-21-->
"id.wiktionary.org/wiki/Modul:utf8debug"
"sv.wiktionary.org/wiki/Modul:utf8debug"
Purpose: allows to debug an incoming UTF8 string (directly submitted or
generated by a template) by splitting it into isolated chars,
checking validity of the UTF8 stream and displaying chars and codes,
or by performing a "hard nowiki" and displaying complete text
including spaces and line breaks
Utilo: ebligas sencimigi enirantan UTF8-signocxenon (rekte enigitan aux
generitan far sxablono) per dispecigo farigxante apartaj signoj,
kontrolante validecon de la UTF8-vico kaj montrante signojn kaj kodojn,
aux per efektivigo de "hard nowiki" kaj montrado de kompleta teksto
inkluzive spacojn kaj liniorompojn
Manfaat: memungkinkan ...
Syfte: moejliggoer att debugga en inkommande UTF8 straeng (direkt oeverlaemnad
eller ...
Used by templates / Uzata far sxablonoj:
* only "debu" (not to be called from any other place, to be
used only for debugging, see below)
Required submodules / Bezonataj submoduloj:
* none / neniuj
Required images:
* "File:Return arrow.svg", Public Domain
This module can accept parameters whether sent to itself (own frame) or
to the caller (caller's frame). If there is a parameter "caller=true"
on the own frame then that own frame is discarded in favor of the
caller's one.
Incoming: * one anonymous and obligatory parameter
* input string (empty is legal but not very
useful, missing ie "nil" same as empty, 64 KiO max)
* two named and optional parameters
* "outctl=" output type selection control string (4 digits,
boolean or fourstate)
* show octet bloat ("0" or "1")
* show big boxes for single char:s ("0" or "1")
* show hard nowiki ("0" or "1" (no colour) or "2" (coloured)
or 3 (coloured and split UTF8))
* show UTF8 char bloat ("0" or "1")
default is "1101", "0000" is prohibited, "nw" is synonymous
with "0010", empty main input switches the type to "1000"
* "empsil=1" to switch on empty input from default big red
box to empty string too
Returned: * large text with complicated wikicode, empty possible
This module is unbreakable (when called with correct module name
and function name).
Cxi tiu modulo estas nerompebla (kiam vokita kun gxustaj nomo de modulo
kaj nomo de funkcio).
This module is special in that it can seem unused and useless. Do not
delete it just because no pages transclude it. Its purpose is not to be used
in article, lemma, appendix or whatever pages. It is intended to be used
temporarily when debugging UTF8 text, preferably from the sandbox. With the
option "hard nowiki" it can even be used for documentation and selftest of
modules and templates. Then the proxy template "debu" can be classed as
a documentation template. Still the template "pate" is a better choice
for this purpose.
Note that "<nowiki>" does NOT work in wikitext generated by a module. We
must DEC-encode instead. This works for the commmon problem char:s ":#*='[]"
(there is no problem with curly "{}"). But DEC-encoding does NOT work for UTF8
multi-octet char:s. So we DEC-encode only some ANSI/ASCII char:s $00...$7F
and leave the remaining ones pass unchanged (both for "big boxes" and "hard
nowiki"). Note that DEC-encoding does NOT work for LF either. In the "big
boxes" mode we catch LF separately, and in "hard nowiki" mode we show an
arrow as image.
In text coming from a module some evil stuff (invalid UTF8 sequence, ZERO,
FF/12, ZWSP, LRM, RLM) is replaced with U+$FFFD by MediaWiki, whereas
other dubious content (TAB, CR, NBSP, BOM) survives.
Color coding of the result in the "big boxes" mode:
1 white ordinary ANSI/ASCII char
2 light grey valid 2-octet UTF8 with some exceptions
3 grey valid 3-octet UTF8 with some exceptions
4 dark grey valid 4-octet UTF8 (with no exceptions yet)
5 red code ZERO or invalid UTF8 sequence or empty main input
6 yellow dubious TAB CR NBSP ZWSP LRM RLM BOM
7 light yellow invisile LF SPACE
8 light blue initial (except empty main input) and final UTF8 bloat report
Error <<FATAL in "utf8debug" : internal error or invalid
parameter'>> is NOT included in the above list, possible causes:
* internal error
* input string too long
* extraneous anonymous parameter
* "outctl=" or "empsil=" bad
Some interesting UTF8 codepoints:
-------- ---------- ----------------------- ------- ----------------------
codepo codepo UTFG-8 short official name and
int HEX int DEC encoding name silly notes
-------- ---------- ----------------------- ------- ----------------------
$0000 #00'000 ZERO
$0009 #00'009 TAB
$000A #00'010 LF
$000D #00'013 CR
$0020 #00'032 SPACE
$007F #00'127 inclusive end of 1-oct
$0080 #00'128 $C2,$80 begin of 2-oct
$00A0 #00'160 $C2,$A0 NBSP don't break me
$00BF #00'191 $C2,$BF inclusive end of $C2,xx
$00C0 #00'192 $C3,$80 begin of $C3,xx
$00FF #00'255 $C3,$BF inclusive end of $C3,xx
$0100 #00'256 $C4,$80 begin of $C4,xx
$0200 #00'512 $C8,$80 uppercase "A" with something above
$0300 #00'768 $CC,$80 strange horizontally misplaced apo
$034F #00'847 $CD,$8F COMBINING GRAPHEME JOINER
$0401 #01'025 $D0,$81 CCCP letter with case delta $50
$0451 #01'105 $D1,$91 CCCP letter with case delta $50
$07FF #02'047 $DF,$BF inclusive end of 2-oct
$0800 #02'048 $E0,$80,$80 begin of 3-oct
$200B #08'203 $E2,$80,$8B ZWSP ZERO WIDTH SPACE
$200C #08'204 $E2,$80,$8C ZWNJ ZERO WIDTH NON-JOINER
$200D #08'205 $E2,$80,$8D ZWJ ZERO WIDTH JOINER
$200E #08'206 $E2,$80,$8E LRM LEFT-TO-RIGHT MARK
$200F #08'207 $E2,$80,$8F RLM RIGHT-TO-LEFT MARK
$2060 #08'288 $E2,$81,$A0 absurd "WORD JOINER"
$2068 #08'296 $E2,$81,$A8 FSI FIRST STRONG ISOLATE
$20AC #08'364 $E2,$82,$AC EURO (bank robbery sign)
$D7FF #55'295 $ED,$9F,$BF last before banned range
$D800 #55'296 ($ED,$A0,$80) begin of banned range
$DFFF #57'343 ($ED,$BF,$BF) inclusive end of banned range
$E000 #57'344 $EE,$80,$80 begin of legal range again
$FEFF #65'279 $EF,$BB,$BF 239,187,191 BOM absurd "BOM" sigi
$FFFD #65'533 $EF,$BF,$BD 239,191,189 REPLACEMENT CHARACTER
$FFFE #65'534 $EF,$BF,$BE 239,191,190 invalid (last 2)
$FFFF #65'535 $EF,$BF,$BF 239,191,191 invalid (last 2), inclusive end of 3-oct
$01'0000 #65'536 $F0,$90,$80,$80 begin of 4-oct
$01'0348 #66'376 $F0,$90,$8D,$88 one of few somewhat known
$0F'FFFF #1'048'575 $F3,$BF,$BF,$BF one Mi almost reached
$10'0000 #1'048'576 $F4,$80,$80,$80 one Mi reached here and no end yet
$10'FFFE #1'114'110 $F4,$8F,$BF,$BE invalid (last 2)
$10'FFFF #1'114'111 $F4,$8F,$BF,$BF invalid (last 2), inclusive end of unicode
$11'0000 #1'114'112 ($F4,$90,$80,$80) invalid (finally out of range)
-------- ---------- ----------------------- ------- ----------------------
* UTF8 is defined by "RFC 3629" from 2003-Nov (but already used to
exist before, though)
* UTF8 sigi AKA BOM : HEX: $EF $BB $BF | DEC: 239 187 191 | ABS: $FEFF
* absolute unicode range has 17 (seventeen !!!) planes per 65'536 values
* totally 1'114'112 codepoints, most of them are unused, plane ZERO is
somewhat full, other ones are almost or totally empty
* official notation: "U+0000" ... "U+10FFFF"
* codepoint range ZERO to 31 is valid by RFC but mostly useless, same for
127, range 128 to 159, whereas 160 AKA " " does appear in wikitext
* range "U+D800" to "U+DFFF" is invalid by RFC
* UTF8 starting octet can be only $C2 to $DF , $E0 to $EF , $F0 to $F4
giving a continuous range from $C2 to $F4 of size $33 = #51 values
* UTF8 subsequent octet:s (1 or 2 or 3) can be only $80 to $BF
(6 bit:s, 64 possible values)
* octet values $C0, $C1 and $F5 to $FF may never appear in a UTF8 stream
Abs. char number range | UTF8 octet sequence | beginning octet
(hexadecimal) | (binary) |
-----------------------+--------------------------------+------------------
0000'0000 to 0000'007F | 0xxxxxxx | $00 to $7F
0000'0080 to 0000'07FF | 110xxxxx 10xxxxxx | $C0 -> $C2 to $DF
0000'0800 to 0000'FFFF | 1110xxxx 10xxxxxx 10xxxxxx | $E0 to $EF
0001'0000 to 0010'FFFF | 11110xxx 10xxxxxx 10xxxxxx ... | $F0 to $F7 -> $F4
]===]
local exporttable = {}
------------------------------------------------------------------------
---- CONSTANTS [O] ----
------------------------------------------------------------------------
-- constant strings (error circumfixes)
local constrkros = ' # # ' -- lagom -> huge circumfix
local constrelabg = '<span class="error"><b>' -- lagom whining begin
local constrelaen = '</b></span>' -- lagom whining end
-- HTML stuff for our tiny table and background around every char
local constrtabu3 = '<table style="display:inline-block; vertical-align:middle; margin:0.15em; padding:0.15em; border:0.15em solid #000000; text-align:center; background-color:#' -- missing color code and many char:s (only 3 ';">' to close element)
local constrtabu4 = ';"><tr><td>'
local constrtabu5 = '</td></tr></table>'
local constrbkg3 = '<span style="font-size:160%;background-color:#E0A0FF;"> '
local constrbkg4 = ' </span>'
local constrpilen = '[[File:Return arrow.svg|20px|link=]]' -- the file is Public Domain
local contabempatwarna = {[0]='FFA0A0','D0FFD0','A0A0FF','D0D0D0'} -- red, light green, blue, light grey
local contabwar8na = {}
contabwar8na = {'FFFFFF','E8E8E8','D0D0D0','B8B8B8','FF6060','FFFF60','FFFFB0','C8C8FF'} -- (index 1...8)
-- constant strings EN vs EO vs ID vs SV
-- local constrkosong = 'empty string submitted' -- EN
local constrkosong = 'malplena signocxeno transdonita' -- EO
-- local constrkosong = 'string datang bersifat kosong' -- ID
-- local constrkosong = 'inkommen string aer tom' -- SV
-- local constrinvalid = 'invalid UTF8 value sequence' -- EN
local constrinvalid = 'nevalida sekvo de UTF8-valoroj' -- EO
-- local constrinvalid = 'rantai nilai UTF8 bersifat invalid' -- ID
-- local constrinvalid = 'ogiltig sekvens av UTF8-vaerden' -- SV
------------------------------------------------------------------------
---- MATH FUNCTIONS [E] ----
------------------------------------------------------------------------
-- Local function MATHDIV
local function mathdiv (xdividend, xdivisor)
local resultdiv = 0 -- DIV operator lacks in LUA :-(
resultdiv = math.floor (xdividend / xdivisor)
return resultdiv
end--function mathdiv
-- Local function MATHMOD
local function mathmod (xdividendo, xdivisoro)
local resultmod = 0 -- MOD operator is "%" and bitwise AND operator lack too
resultmod = xdividendo % xdivisoro
return resultmod
end--function mathmod
------------------------------------------------------------------------
-- Local function MATHXOR
-- Depends on functions :
-- [E] mathdiv mathmod
local function mathxor (xa, xb)
local resultxor = 0
local crap6 = 0
local crap7 = 0
local crap8 = 1 -- single bit value 1 -> 2 -> 4 -> 8 ...
while true do
if ((xa==0) and (xb==0)) then
break -- we have run out of bits on both
end--if
crap6 = mathmod (xa,2) -- pick bit before dividing
crap7 = mathmod (xb,2) -- pick bit before dividing
xa = mathdiv (xa,2) -- shift right
xb = mathdiv (xb,2) -- shift right
if (crap6~=crap7) then
resultxor = resultxor + crap8 -- add one bit rtl only if true
end--if
crap8 = crap8 * 2
end--while
return resultxor
end--function mathxor
------------------------------------------------------------------------
---- NUMBER CONVERSION FUNCTIONS [N] ----
------------------------------------------------------------------------
-- Local function LFDEC1DIGIT
-- Convert 1 decimal ASCII digit to integer 0...9 (255 if invalid)
local function lfdec1digit (num1digit)
num1digit = num1digit - 48 -- may become invalid
if ((num1digit<0) or (num1digit>9)) then
num1digit = 255
end--if
return num1digit
end--function lfdec1digit
------------------------------------------------------------------------
-- Local function LFNUINT8TOHEX
-- Convert UINT8 (0...255) to a 2-digit hex string.
-- Depends on functions :
-- [E] mathdiv mathmod
local function lfnuint8tohex (numinclow)
local strheksulo = ''
local numhajhaj = 0
numhajhaj = mathdiv (numinclow,16)
numinclow = mathmod (numinclow,16)
if (numhajhaj>9) then
numhajhaj = numhajhaj + 7 -- now 0...9 or 17...22
end--if
if (numinclow>9) then
numinclow = numinclow + 7 -- now 0...9 or 17...22
end--if
strheksulo = string.char (numhajhaj+48) .. string.char (numinclow+48)
return strheksulo
end--function lfnuint8tohex
------------------------------------------------------------------------
-- Local function LFUINT32TOHEX
-- Convert UINT32 (0 ... $FFFF'FFFF = #4'294'967'295) to
-- a (2 or 4 or 6 or 8)-digit hex string.
-- Depends on functions :
-- [N] lfnuint8tohex
-- [E] mathdiv mathmod
local function lfuint32tohex (numincom)
local strheksulego = ''
while true do
strheksulego = lfnuint8tohex ( mathmod (numincom,256) ) .. strheksulego
numincom = mathdiv (numincom,256)
if (numincom==0) then
break
end--if
end--while
return strheksulego
end--function lfuint32tohex
------------------------------------------------------------------------
---- LOW LEVEL STRING FUNCTIONS [G] ----
------------------------------------------------------------------------
-- test whether char is an ASCII digit "0"..."9", return boolean
local function lfgtestnum (numkaad)
local boodigit = false
boodigit = ((numkaad>=48) and (numkaad<=57))
return boodigit
end--function lfgtestnum
------------------------------------------------------------------------
-- test whether char is an ASCII uppercase letter, return boolean
local function lfgtestuc (numkode)
local booupperc = false
booupperc = ((numkode>=65) and (numkode<=90))
return booupperc
end--function lfgtestuc
------------------------------------------------------------------------
-- test whether char is an ASCII lowercase letter, return boolean
local function lfgtestlc (numcode)
local boolowerc = false
boolowerc = ((numcode>=97) and (numcode<=122))
return boolowerc
end--function lfgtestlc
------------------------------------------------------------------------
-- Local function LFGIS62SAFE
-- Test whether incoming ASCII char is very safe (0...9 A...Z a...z).
-- Depends on functions :
-- [G] lfgtestnum lfgtestuc lfgtestlc
local function lfgis62safe (numcxair)
local booguud = false
booguud = lfgtestnum (numcxair) or lfgtestuc (numcxair) or lfgtestlc (numcxair)
return booguud
end--function lfgis62safe
------------------------------------------------------------------------
---- SOME FUNCTIONS ---- !!!FIXME!!!
------------------------------------------------------------------------
-- Local function LFHEXDEC
-- Example output : "$FE=#254" (we have to save text with)
-- Depends on "lfnuint8tohex"
local function lfhexdec (numkodo)
local strrezulto = ''
strrezulto = "$" .. lfnuint8tohex (numkodo) .. "=#" .. tostring (numkodo)
return strrezulto
end--function lfhexdec
------------------------------------------------------------------------
-- Local function LFBUNCH !!!FIXME!!!
-- Add digit bunching to raw decimal number string
local function lfbunch (strnomorin)
local strnomorut = ""
local numlenn = 0
local numindeex = 0 -- ZERO-based counts up
local numcaar = 0 -- char of string
numlenn = string.len(strnomorin)
while true do
if (numindeex==numlenn) then
break
end--if
numcaar = string.byte(strnomorin,(numlenn-numindeex),(numlenn-numindeex))
if ((mathmod(numindeex,3)==0) and (numindeex~=0)) then
strnomorut = "'" .. strnomorut -- apo
end--if
strnomorut = string.char(numcaar) .. strnomorut
numindeex = numindeex + 1 -- index counts up but we go back
end--while
return strnomorut
end--function lfbunch
------------------------------------------------------------------------
---- UTF8 FUNCTIONS [U] ----
------------------------------------------------------------------------
-- Local function LFULNUTF8CHAR
-- Evaluate length of a single UTF8 char in octet:s.
-- Input : * numbgoctet -- beginning octet of a UTF8 char
-- Output : * numlen1234x -- number 1...4 or ZERO if invalid
-- Does NOT thoroughly check the validity, looks at 1 octet only.
local function lfulnutf8char (numbgoctet)
local numlen1234x = 0
if (numbgoctet<128) then
numlen1234x = 1 -- $00...$7F -- ANSI/ASCII
end--if
if ((numbgoctet>=194) and (numbgoctet<=223)) then
numlen1234x = 2 -- $C2 to $DF
end--if
if ((numbgoctet>=224) and (numbgoctet<=239)) then
numlen1234x = 3 -- $E0 to $EF
end--if
if ((numbgoctet>=240) and (numbgoctet<=244)) then
numlen1234x = 4 -- $F0 to $F4
end--if
return numlen1234x
end--function lfulnutf8char
------------------------------------------------------------------------
-- Local function LFUTF8DEKO
-- Decode a single UTF8 char, return ZERO length if invalid.
-- Output : * "tabresult" -- LUA table [0] length and [1] codepoint
-- Depends on functions :
-- [E] mathdiv mathmod mathxor
local function lfutf8deko (num0, num1, num2, num3)
local tabresult = {}
local numlength = 0 -- preASSume invalid
local numkodepoin = 0 -- preASSume invalid
num1 = mathxor (num1,128) -- XOR 3 of 4
num2 = mathxor (num2,128) -- XOR 3 of 4
num3 = mathxor (num3,128) -- XOR 3 of 4
while true do -- fake loop
if ((num0>193) and (num1>63)) then
break -- to join mark
end--if
if ((num0>223) and (num2>63)) then
break -- to join mark
end--if
if ((num0>239) and (num3>63)) then
break -- to join mark
end--if
if (num0<128) then -- ZERO to $7F
numkodepoin = num0
numlength = 1
break -- to join mark
end--if
if ((num0>193) and (num0<224)) then -- $C0 # $C2 to $DF
numkodepoin = (mathxor(num0,192)) * 64 + num1
if ((numkodepoin>127) and (numkodepoin<2048)) then
numlength = 2
end--if
break -- to join mark
end--if
if ((num0>223) and (num0<240)) then -- $E0 to $EF
numkodepoin = (mathxor(num0,224)) * 4096 + num1 * 64 + num2
if (((numkodepoin>2047) and (numkodepoin<55296)) or ((numkodepoin>57343) and (numkodepoin<65536))) then
numlength = 3
end--if
break -- to join mark
end--if
if ((num0>239) and (num0<245)) then -- $F0 to $F7 # $F4
numkodepoin = (mathxor(num0,240)) * 262144 + num1 * 4096 + num2 * 64 + num3
if ((numkodepoin>65535) and (numkodepoin<1114112)) then
numlength = 4
end--if
break -- to join mark
end--if
break -- finally to join mark
end--while -- fake loop -- join mark
tabresult [0] = numlength
tabresult [1] = numkodepoin
return tabresult
end--function lfutf8deko
------------------------------------------------------------------------
---- HIGH LEVEL STRING FUNCTIONS [I] ----
------------------------------------------------------------------------
-- Local function LFIULTENCODE
-- Generously encode char:s to prevent parsing and show hex if needed, make
-- single chars visible, bypass all wiki parsing and HTML parsing. Our cool
-- module has brewed something with "[["..."]]" and repeated spaces but we
-- want to see plain text for debugging purposes. Thus we dec-encode some
-- char:s, use NBSP to fix spaces, workaround EOL, and maybe add colour.
-- Input : * strkrampuj : string, empty tolerable, but type "nil" is NOT
-- * nummxwidth : maximal width of text (20...200, default 80)
-- * boowarrna : "true" to enable color
-- * boosplitutf : "true" to split UTF8 char:s into hex numbers
-- Output : * strkood : string, empty in worst case
-- Depends on functions :
-- [U] lfulnutf8char
-- [G] lfgtestnum lfgtestuc lfgtestlc lfgis62safe
-- [N] lfnuint8tohex
-- [E] mathdiv mathmod
-- Depends on constants :
-- * string constrpilen [[File:...]]
-- * table contabempatwarna 0...3
-- This helps with:
-- * "[["..."]]", "["..."]", "*", "#", ":" (note that there is no
-- problem with plain "{{"..."}}")
-- * multiple spaces (they are no longer reduced to one piece due to HTML)
-- * EOL:s (they do not vanish in favor of spaces due to HTML, instead
-- the EOL arrow is showed)
-- * too long lines (they are force-broken)
-- * codes below 32 other than EOL
-- There is also "mw.text.nowiki" with some limitations, most notably
-- about multiple spaces and EOL:s.
-- In order to fix EOL we show the EOL arrow (preceded by space) for every
-- incoming LF, but do a "<br>" only once after multiple subsequent LF:s.
-- We must be UTF8-aware. A UTF8 char must be either split into hex codes,
-- or preserved over its complete length ie not split nor encoded at all.
-- Note that this causes BLOAT. The caller is responsible for
-- adding "<big>"..."</big>" if desired.
local function lfiultencode (strkrampuj,nummxwidth,boowarrna,boosplitutf)
local stronechar = ''
local strkolorr = ''
local strkood = ''
local numstrlne = 0
local numpeekynx = 1 -- ONE-based index
local numcahr = 0
local numcxxhr = 0
local numutf8len = 0
local numaccuwidth = 0 -- accumulated width
local numcolour = 0 -- 0,1,2,3 -- R,G,B,Y
local boonbsp = true -- "true" needed for junk lines containing only space
local boosplnow = false -- allow forced split in some cases
local boofickpilen = false -- true after LF arrow causes "<br>" later
if (type(nummxwidth)~='number') then
nummxwidth = 80
end--if
if ((nummxwidth<20) or (nummxwidth>200)) then
nummxwidth = 80
end--if
numstrlne = string.len (strkrampuj)
while true do -- outer genuine loop
if (numpeekynx>numstrlne) then
break
end--if
numcahr = string.byte (strkrampuj,numpeekynx,numpeekynx)
numpeekynx = numpeekynx + 1 -- ONE-based index
while true do -- inner fake loop
if (numcahr==10) then
break -- to join mark -- inner fake loop -- special processing for LF
end--if
if (numcahr==32) then
if (boonbsp) then
stronechar = ' ' -- this prevents space reduction
else
stronechar = ' '
end--if
boonbsp = not boonbsp
break -- to join mark -- inner fake loop
end--if
if (numcahr<32) then
stronechar = '{$' .. lfnuint8tohex (numcahr) .. '}' -- always as hex
break -- to join mark -- inner fake loop
end--if
if (numcahr>127) then
boosplnow = boosplitutf
numutf8len = lfulnutf8char (numcahr)
if (numutf8len==0) then
boosplnow = true -- forced split for broken UTF8 sequence
else
numutf8len = numutf8len - 1 -- more char:s to pick
end--if
if ((numpeekynx+numutf8len)>(numstrlne+1)) then
boosplnow = true -- forced split for truncated UTF8 sequence
end--if
if (boosplnow) then
stronechar = '{$' .. lfnuint8tohex (numcahr) .. '}'
else
stronechar = string.char (numcahr) -- preserve "numcahr" below
while true do -- deep loop copy UTF8 char
if (numutf8len==0) then
break
end--if
numcxxhr = string.byte (strkrampuj,numpeekynx,numpeekynx)
numpeekynx = numpeekynx + 1
numutf8len = numutf8len - 1
stronechar = stronechar .. string.char (numcxxhr)
end--while -- deep loop copy UTF8 char
end--if
break -- to join mark
end--if (numcahr>127) then
if (lfgis62safe(numcahr)) then -- safe ASCII ie 0...9 A...Z a...z
stronechar = string.char (numcahr) -- do NOT encode safe char:s
break -- to join mark
end--if
stronechar = '&#' .. tostring (numcahr) .. ';' -- dec-encode some ASCII
break -- finally to join mark
end--while -- inner fake loop -- join mark
if (numcahr==10) then
if (numaccuwidth>=nummxwidth) then
strkood = strkood .. '<br>'
numaccuwidth = 0
boonbsp = true -- "true" needed for junk lines containing only space
end--if
strkood = strkood .. ' ' .. constrpilen
numaccuwidth = numaccuwidth + 2 -- counts doubly
boofickpilen = true
else
if (boofickpilen or (numaccuwidth>=nummxwidth)) then
strkood = strkood .. '<br>'
numaccuwidth = 0
boonbsp = true -- "true" needed for junk lines containing only space
end--if
if (boowarrna) then
strkolorr = contabempatwarna [numcolour]
numcolour = mathmod ((numcolour+1),4) -- index 0...3
strkood = strkood .. '<span style="background-color:#' .. strkolorr .. ';">' .. stronechar .. '</span>'
else
strkood = strkood .. stronechar
end--if
numaccuwidth = numaccuwidth + 1
boofickpilen = false
end--if (numcahr==10) else
end--while -- outer genuine loop
return strkood
end--function lfiultencode
------------------------------------------------------------------------
-- Local function LFIVALIUMDCTLSTR
-- Validate control string against restrictive pattern (dec).
-- Input : * strresdpat -- restrictive pattern (max 200 char:s)
-- * strctldstr -- incoming suspect
-- Output : * numbadpos -- bad position, or 254 wrong length, or 255 success
-- Depends on functions :
-- [N] lfdec1digit
-- Content of restrictive pattern:
-- * "." -- skip check
-- * "-" and "?" -- must match literally
-- * digit "1"..."9" ("0" invalid) -- inclusive upper limit (min ZERO)
local function lfivaliumdctlstr (strresdpat, strctldstr)
local numlenresdpat = 0
local numldninkom = 0
local numcomperindex = 0 -- ZERO-based
local numead2 = 0
local numead3 = 0
local numbadpos = 254 -- preASSume guilt (len differ or too long or ...)
local booddaan = false
numlenresdpat = string.len(strresdpat)
numldninkom = string.len(strctldstr)
if ((numlenresdpat<=200) and (numlenresdpat==numldninkom)) then
while true do
if (numcomperindex==numlenresdpat) then
numbadpos = 255
break -- success
end--if
numead2 = string.byte(strresdpat,(numcomperindex+1),(numcomperindex+1)) -- rest
numead3 = string.byte(strctldstr,(numcomperindex+1),(numcomperindex+1)) -- susp
booddaan = false
if ((numead2==45) or (numead2==63)) then
if (numead2~=numead3) then
numbadpos = numcomperindex
break -- "-" and "?" must match literally
end--if
booddaan = true -- position OK
end--if
if (numead2==46) then -- skip for dot "."
booddaan = true -- position OK
end--if
if (not booddaan) then
numead2 = lfdec1digit(numead2) -- rest
if (numead2>9) then -- limit defined or bad ??
numbadpos = 254
break -- bad restrictive pattern
else
numead3 = lfdec1digit(numead3) -- susp
if (numead3>numead2) then
numbadpos = numcomperindex
break -- value limit violation
end--if
end--if (numead2>9) else
end--if (not booddaan) then
numcomperindex = numcomperindex + 1
end--while
end--if ((numlenresdpat<=200) and (numlenresdpat==numldninkom)) then
return numbadpos
end--function lfivaliumdctlstr
------------------------------------------------------------------------
---- HIGH LEVEL FUNCTIONS [H] ----
------------------------------------------------------------------------
-- Local function LFWARNA
-- Convert integer 1...8 (must be valid) to 6 digits hex color.
-- fill the gap between "constrtabu3" and "constrtabu4" always with help of
-- this sub, do NOT put hardcoded color values there
-- we use "contabwar8na"
-- 1 white default, 2...4 grey getting darker,
-- 5 red, 6 yellow, 7 light yellow, 8 light blue
local function lfwarna (indexofcolor)
local strfaerg = ''
strfaerg = contabwar8na [indexofcolor]
return strfaerg
end--function lfwarna
------------------------------------------------------------------------
-- Local function LFHIGATEYELLOW !!!FIXME!!! use table
-- Detect TAB CR NBSP ZWSP LRM RLM BOM -- "yellow class error"
-- ZERO is "red class error" and not included here
local function lfhigateyellow (numcodepoint)
local strnamev = ''
if (numcodepoint== 9) then
strnamev = 'TAB'
end--if
if (numcodepoint== 13) then
strnamev = 'CR'
end--if
if (numcodepoint== 160) then
strnamev = 'NBSP'
end--if
if (numcodepoint== 8203) then
strnamev = 'ZWSP'
end--if
if (numcodepoint== 8206) then
strnamev = 'LRM'
end--if
if (numcodepoint== 8207) then
strnamev = 'RLM'
end--if
if (numcodepoint==65279) then
strnamev = 'BOM'
end--if
return strnamev
end--function lfhigateyellow
------------------------------------------------------------------------
-- Local function LFIGATESPECIAL !!!FIXME!!! use table
-- Detect LF SPACE -- "light yellow class char"
local function lfigatespecial (numcoodepoint)
local strnme = ''
if (numcoodepoint== 10) then
strnme = 'LF'
end--if
if (numcoodepoint== 32) then
strnme = 'SPACE'
end--if
return strnme
end--function lfigatespecial
------------------------------------------------------------------------
---- VARIABLES [R] ----
------------------------------------------------------------------------
function exporttable.ek (arxframent)
-- general unknown type
local vartmp = 0 -- variable without type
-- special type "args" AKA "arx"
local arxsomons = 0 -- metaized "args" from our own or caller's "frame"
-- general "tab"
local tabutf8dec = {}
-- general "str"
local strinc = "" -- incoming text
local strctrl = "" -- from optional parameter
local strmytemp = ""
local strret = "" -- final output string
-- general "num"
local numlongtx = 0 -- length of incoming parameter
local numlung = 0 -- temp
local numwarna = 0
local numoct = 0 -- temp some char
local numodt = 0 -- temp some char
local numoet = 0 -- temp some char
local numoft = 0 -- temp some char
local numutflen = 0
local numchrlen = 0 -- number of UTF8 char:s
-- general "boo"
local boocrap = false
local boopendlf = false -- pending LF between sections
-- more "boo" from parameters
local booempsil = false
local boooktblo = false
local boobigbox = false -- show big boxes
local boohardnw = false -- "true" from "1" or "2" or "3"
local boohnwcol = false -- "true" from "2" or "3" only
local boohnwspt = false -- "true" from "3" only
local booutfblo = false -- show UTF8 char bloat
------------------------------------------------------------------------
---- MAIN [Z] ----
------------------------------------------------------------------------
---- GUARD AGAINST INTERNAL ERROR ----
-- "constrkosong" and "constrinvalid" must be uncommented and assigned
-- note that reporting of this error may NOT depend on uncommentable strings
boocrap = ((type(constrkosong)~="string") or (type(constrinvalid)~="string"))
---- GET THE ARX (ONE OF TWO) ----
if (not boocrap) then
arxsomons = arxframent.args -- "args" from our own "frame"
vartmp = arxsomons ["caller"]
if (vartmp=="true") then
arxsomons = arxframent:getParent().args -- "args" from caller's "frame"
end--if
end--if
---- CHECK ----
if (not boocrap) then
if (type(arxsomons[2])=="string") then
boocrap = true -- too much
end--if
end--if
---- SEIZE ONE ANONYMOUS AND OBLIGATORY PARAMETER ----
-- on success assign "strinc" and "numlongtx" (not to be touched later)
if (not boocrap) then
vartmp = arxsomons [1]
if (type(vartmp)=="string") then
numlongtx = string.len (vartmp)
if (numlongtx>65536) then
boocrap = true -- this causes bloat, we can never enocode such big
else
strinc = vartmp
end--if
end--if (type(vartmp)=="string") then
end--if
---- SEIZE AND CHECK BIG NAMED AND OPTIONAL PARAMETER ----
-- default is "1101", "0000" is prohibited, "nw" is synonymous
-- with "0010", empty main input switches the type to "1000"
if (not boocrap) then
do -- scope
local vartumip = 0
local numsilur = 0
strctrl = "1101" -- default
vartumip = arxsomons ["outctl"]
if (type(vartumip)=="string") then
if (vartumip=="nw") then -- alias
vartumip = "0010"
end--if
if (vartumip=="0000") then
boocrap = true
else
numsilur = lfivaliumdctlstr ('1131',vartumip)
if (numsilur==255) then
strctrl = vartumip
else
boocrap = true
end--if
end--if
end--if (type(vartumip)=="string") then
end--do scope
end--if (not boocrap) then
---- SEIZE AND CHECK BOOLEAN NAMED AND OPTIONAL PARAMETER ----
if (not boocrap) then
vartmp = arxsomons ["empsil"]
if (type(vartmp)=="string") then
if (vartmp=="1") then
booempsil = true
else
boocrap = true
end--if
end--if
end--if
---- EMPTINESS ----
if ((not boocrap) and (numlongtx==0)) then
if (booempsil) then
strctrl = "0000" -- empty main input switches type to silly "0000"
else
strctrl = "1000" -- empty main input switches type to "1000"
end--if
end--if
---- PROCESS CONTROL STRING ----
if (not boocrap) then
numoft = string.byte(strctrl,1,1)
boooktblo = (numoft==49) -- octet bloat
numoft = string.byte(strctrl,2,2)
boobigbox = (numoft==49) -- big boxes
numoft = string.byte(strctrl,3,3) -- types of "hard nowiki"
boohardnw = (numoft~=48) -- "true" from "1" or "2" or "3"
boohnwcol = (numoft>49) -- "true" from "2" or "3" only
boohnwspt = (numoft==51) -- "true" from "3" only
numoft = string.byte(strctrl,4,4)
booutfblo = (numoft==49) -- UTF8 char bloat
end--if
---- WHINE IF YOU MUST ----
-- note that reporting of this error may NOT depend of uncommentable strings
if (boocrap) then
strmytemp = 'FATAL in "utf8debug" : internal error or invalid parameter'
strret = constrkros .. constrelabg .. strmytemp .. constrelaen .. constrkros
end--if
---- OCTET BLOAT ----
-- empty main input switches type to "1000" ie only "boooktblo" is true,
-- or "0000" (invalid from caller)
if ((not boocrap) and boooktblo) then
if (numlongtx==0) then
numwarna = 5 -- red on empty string (only 5 or 8 here)
strmytemp = constrkosong
else
numwarna = 8 -- light blue (only 5 or 8 here)
strmytemp = "number of<br>octet:s : " .. lfbunch (tostring (numlongtx) )
end--if
strret = constrtabu3 .. lfwarna (numwarna) .. constrtabu4 .. strmytemp .. constrtabu5
boopendlf = true -- the earliest one, "boopendlf" not assigned above
end--if
---- BIG BOXES ----
-- incoming "strinc" and "numlongtx"
-- we brew a private HTML table with just one cell for every single char
-- this is done for both boobigbox (use string output) and
-- booutfblo (discard string output, "numchrlen" is the big prey)
if ((not boocrap) and (boobigbox or booutfblo)) then
do -- scope
local strnamevil = '' -- name of a bad char, for example "CR" "ZWSP"
local strnamechr = '' -- name of special char, for example "LF" "SPACE"
local strsngchar = '' -- one char with "span" background
local strchrblok = '' -- prebrewed block with table for one char
local strbunch = ''
local numindx = 0 -- counts octet:s
local numreserv = 0
local numdecode = 0 -- decoded "codepoint" value
numchrlen = 0 -- counts UTF8 char:s, pass to below
while true do
if (numindx>=numlongtx) then
break
end--if
numreserv = numlongtx - numindx -- at least 1
numoct = string.byte (strinc,(numindx+1),(numindx+1))
numodt = 0
numoet = 0
numoft = 0
if (numreserv>=2) then
numodt = string.byte (strinc,(numindx+2),(numindx+2))
end--if
if (numreserv>=3) then
numoet = string.byte (strinc,(numindx+3),(numindx+3))
end--if
if (numreserv>=4) then
numoft = string.byte (strinc,(numindx+4),(numindx+4))
end--if
tabutf8dec = lfutf8deko (numoct,numodt,numoet,numoft)
numutflen = tabutf8dec [0]
numdecode = tabutf8dec [1]
strnamevil = '' -- preASSume, NOT reporting any name -- yellow
strnamechr = '' -- preASSume, NOT reporting any name -- light yellow
if (numutflen~=0) then
strnamevil = lfhigateyellow (numdecode) -- re empty string if no hit
strnamechr = lfigatespecial (numdecode) -- re empty string if no hit
end--if
numwarna = numutflen -- preASSume, ZERO to 4, ZERO is invalid
if ((numoct==0) or (numutflen==0)) then
numwarna = 5 -- red on code ZERO or invalid sequence
if (numoct==0) then
strnamevil = "ZERO"
end--if
end--if
if (strnamevil~='') then
numwarna = 6 -- yellow on TAB CR NBSP ZWSP LRM RLM BOM
end--if
if (strnamechr~='') then
numwarna = 7 -- light yellow on LF SPACE
end--if
strchrblok = constrtabu3 .. lfwarna (numwarna) .. constrtabu4 .. "<small>index</small> " .. lfbunch (tostring (numindx) )
strchrblok = strchrblok .. "<br><small>beg code</small> " .. lfhexdec (numoct)
if (numutflen==0) then
strchrblok = strchrblok .. "<br>" .. constrinvalid -- color sudah done before
else
strchrblok = strchrblok .. "<br><small>length</small> " .. tostring (numutflen)
strsngchar = string.char (numoct) -- maybe we will need it
if (numutflen>=2) then
strchrblok = strchrblok .. "<br><small>extra</small> $" .. lfnuint8tohex (numodt)
strsngchar = strsngchar .. string.char (numodt)
if (numutflen>=3) then
strchrblok = strchrblok .. ",$" .. lfnuint8tohex (numoet)
strsngchar = strsngchar .. string.char (numoet)
end--if
if (numutflen==4) then
strchrblok = strchrblok .. ",$" .. lfnuint8tohex (numoft)
strsngchar = strsngchar .. string.char (numoft)
end--if
strchrblok = strchrblok .. "<br><small>codepoint</small> U+$" .. lfuint32tohex (numdecode)
strchrblok = strchrblok .. "<br><small>dec</small> #" .. lfbunch (tostring (numdecode) )
end--if (numutflen>=2) then
if (strnamevil~='') then
strchrblok = strchrblok .. "<br>" .. strnamevil -- whine only if reason to
end--if
if (strnamechr~='') then
strchrblok = strchrblok .. "<br>" .. strnamechr -- boast only if reason to
end--if
if ((strnamevil..strnamechr)=='') then
strchrblok = strchrblok .. "<br>" .. constrbkg3 -- begin char background
if (numutflen==1) then
strchrblok = strchrblok .. "&#" .. tostring (numoct) .. ";" -- give a F**K in "strsngchar"
else
strchrblok = strchrblok .. strsngchar -- let wiki software & browser bother
end--if
strchrblok = strchrblok .. constrbkg4 -- close char background
end--if
end--if (numutflen==0) else
strchrblok = strchrblok .. constrtabu5 -- close table
numindx = numindx + numutflen -- ZERO-based index
numchrlen = numchrlen + 1 -- invalid char:s do count too, the big prey
if (boobigbox) then
strbunch = strbunch .. strchrblok -- use or discard
end--if
end--while
if (boobigbox) then
if (boopendlf) then
strret = strret .. "<br>"
end--if
strret = strret .. strbunch
boopendlf = true
end--if
end--do scope
end--if ((not boocrap) and (boobigbox or booutfblo)) then
---- HARD NOWIKI ----
-- incoming "strinc" and "numlongtx"
-- boohardnw "true" from "1" or "2" -- do "hard nowiki"
-- boohnwcol "true" from "2" only -- requested colour
-- restrict the width to 100 char:s (HTML parser breaks
-- on spaces only, we break at 100)
if ((not boocrap) and boohardnw) then
if (boopendlf) then
strret = strret .. "<br>"
end--if
strret = strret .. "<big>" .. lfiultencode (strinc,100,boohnwcol,boohnwspt) .. "</big>"
boopendlf = true
end--if
---- UTF8 BLOAT ----
-- incoming "numchrlen" cannot be ZERO if "booutfblo" is "true"
if ((not boocrap) and booutfblo) then
if (boopendlf) then
strret = strret .. "<br>" -- the last one, "boopendlf" not needed below
end--if
strmytemp = "number of UTF8<br>char:s : " .. lfbunch (tostring (numchrlen) )
strret = strret .. constrtabu3 .. lfwarna (8) .. constrtabu4 .. strmytemp .. constrtabu5
end--if
---- RETURN THE JUNK STRING ----
return strret
end--function
---- RETURN THE JUNK LUA TABLE ----
return exporttable