1 {-# LANGUAGE RankNTypes, BangPatterns #-}
3 -- | This module uses the stream decoding functions from
4 -- <http://hackage.haskell.org/package/streaming-commons streaming-commons>
5 -- package to define decoding functions and lenses. The exported names
6 -- conflict with names in @Data.Text.Encoding@ but not with the @Prelude@
8 module Pipes.Text.Encoding
10 -- * Decoding ByteStrings and Encoding Texts
18 -- * Basic lens operations
29 -- * Non-lens decoding functions
37 -- * Re-encoding functions
44 -- * Functions for latin and ascii text
53 import Data.Functor.Constant (Constant(..))
54 import Data.Char (ord)
55 import Data.ByteString as B
56 import Data.ByteString.Char8 as B8
57 import Data.Text (Text)
58 import qualified Data.Text as T
59 import qualified Data.Text.Encoding as TE
60 import qualified Data.Streaming.Text as Stream
61 import Data.Streaming.Text (DecodeResult(..))
62 import Control.Monad (join, liftM)
70 > text :: Producer Text IO ()
72 we can encode it with @Data.Text.Encoding@ and ordinary pipe operations:
74 > text >-> P.map TE.encodeUtf8 :: Producer.ByteString IO ()
76 or, using this module, with
78 > for text encodeUtf8 :: Producer.ByteString IO ()
82 > bytes :: Producer ByteString Text IO ()
84 we can apply a decoding function from this module:
86 > decodeUtf8 bytes :: Producer Text IO (Producer ByteString IO ())
88 The Text producer ends wherever decoding first fails. Thus we can re-encode
89 as uft8 as much of our byte stream as is decodeUtf16BE decodable, with, e.g.
91 > for (decodeUtf16BE bytes) encodeUtf8 :: Producer ByteString IO (Producer ByteString IO ())
93 The bytestring producer that is returned begins with where utf16BE decoding
94 failed; if it didn't fail the producer is empty.
99 We get a bit more flexibility, though, if we use a lens like @utf8@ or @utf16BE@
100 that looks for text in an appropriately encoded byte stream.
102 > type Lens' a b = forall f . Functor f => (b -> f b) -> (a -> f a)
104 is just an alias for a Prelude type. We abbreviate this further, for our use case, as
107 > = forall m r . Monad m => Lens' (Producer ByteString m r) (Producer Text m (Producer ByteString m r))
109 and call the decoding lenses @utf8@, @utf16BE@ \"codecs\", since they can
110 re-encode what they have decoded. Thus you use any particular codec with
111 the @view@ / @(^.)@ , @zoom@ and @over@ functions from the standard lens libraries;
112 we presuppose neither <http://hackage.haskell.org/package/lens lens>
113 nor <http://hackage.haskell.org/package/lens-family lens-family>
114 since we already have access to the types they require.
116 Each decoding lens looks into a byte stream that is supposed to contain text.
117 The particular lenses are named in accordance with the expected
118 encoding, 'utf8', 'utf16LE' etc. To turn a such a lens or @Codec@
119 into an ordinary function, use @view@ / @(^.)@ -- here also called 'decode':
121 > view utf8 :: Producer ByteString m r -> Producer Text m (Producer ByteString m r)
122 > decode utf8 Byte.stdin :: Producer Text IO (Producer ByteString IO r)
123 > Bytes.stdin ^. utf8 :: Producer Text IO (Producer ByteString IO r)
125 These simple uses of a codec with @view@ or @(^.)@ or 'decode' can always be replaced by
126 the specialized decoding functions exported here, e.g.
128 > decodeUtf8 :: Producer ByteString m r -> Producer Text m (Producer ByteString m r)
129 > decodeUtf8 Byte.stdin :: Producer Text IO (Producer ByteString IO r)
131 As with these functions, the stream of text that a @Codec@ \'sees\'
132 in the stream of bytes begins at its head.
133 At any point of decoding failure, the stream of text ends and reverts to (returns)
134 the original byte stream. Thus if the first bytes are already
135 un-decodable, the whole ByteString producer will be returned, i.e.
137 > view utf8 bad_bytestream
139 will just come to the same as
141 > return bad_bytestream
143 Where there is no decoding failure, the return value of the text stream will be
144 an empty byte stream followed by its own return value. In all cases you must
145 deal with the fact that it is a /ByteString producer/ that is returned, even if
146 it can be thrown away with @Control.Monad.void@
148 > void (Bytes.stdin ^. utf8) :: Producer Text IO ()
150 The @eof@ lens permits you to pattern match: if there is a Right value,
151 it is the leftover bytestring producer, if there is a Right value, it
152 is the return value of the original bytestring producer:
154 > Bytes.stdin ^. utf8 . eof :: Producer Text IO (Either (Producer ByteString IO IO) ())
156 Thus for the stream of un-decodable bytes mentioned above,
158 > view (utf8 . eof) bad_bytestream
162 > return (Left bad_bytestream)
164 @zoom@ converts a Text parser into a ByteString parser:
166 > zoom utf8 drawChar :: Monad m => StateT (Producer ByteString m r) m (Maybe Char)
168 or, using the type synonymn from @Pipes.Parse@:
170 > zoom utf8 drawChar :: Monad m => Parser ByteString m (Maybe Char)
172 Thus we can define a ByteString parser (in the pipes-parse sense) like this:
174 > charPlusByte :: Parser ByteString m (Maybe Char, Maybe Word8)))
175 > charPlusByte = do char_ <- zoom utf8 Text.drawChar
176 > byte_ <- Bytes.peekByte
177 > return (char_, byte_)
179 Though @charPlusByte@ is partly defined with a Text parser 'drawChar';
180 but it is a ByteString parser; it will return the first valid utf8-encoded
181 Char in a ByteString, whatever its byte-length,
182 and the first byte following, if both exist. Because
183 we \'draw\' one and \'peek\' at the other, the parser as a whole only
184 advances one Char's length along the bytestring, whatever that length may be.
185 See the slightly more complex example \'decode.hs\' in the
186 <http://www.haskellforall.com/2014/02/pipes-parse-30-lens-based-parsing.html#batteries-included haskellforall blog>
187 discussion of this type of byte stream parsing.
190 type Lens' a b = forall f . Functor f => (b -> f b) -> (a -> f a)
195 => Lens' (Producer ByteString m r)
196 (Producer Text m (Producer ByteString m r))
199 {- | @decode@ is just the ordinary @view@ or @(^.)@ of the lens libraries;
200 exported here under a name appropriate to the material. Thus
202 > decode utf8 bytes :: Producer Text IO (Producer ByteString IO ())
204 All of these are thus the same:
206 > decode utf8 bytes = view utf8 bytes = bytes ^. utf8 = decodeUtf8 bytes
211 decode :: ((b -> Constant b b) -> (a -> Constant b a)) -> a -> b
212 decode codec a = getConstant (codec Constant a)
214 {- | @eof@ tells you explicitly when decoding stops due to bad bytes or
215 instead reaches end-of-file happily. (Without it one just makes an explicit
216 test for emptiness of the resulting bytestring production using next) Thus
218 > decode (utf8 . eof) bytes :: Producer T.Text IO (Either (Producer B.ByteString IO ()) ())
220 If we hit undecodable bytes, the remaining bytestring producer will be
221 returned as a Left value; in the happy case, a Right value is returned
222 with the anticipated return value for the original bytestring producer.
224 Again, all of these are the same
226 > decode (utf8 . eof) bytes = view (utf8 . eof) p = p^.utf8.eof
230 eof :: Monad m => Lens' (Producer Text m (Producer ByteString m r))
231 (Producer Text m (Either (Producer ByteString m r) r))
232 eof k p0 = fmap fromEither (k (toEither p0)) where
234 fromEither = liftM (either id return)
236 toEither pp = do p <- pp
239 check p = do e <- lift (next p)
241 Left r -> return (Right r)
242 Right (bs,pb) -> if B.null bs
244 else return (Left (do yield bs
248 utf8 = mkCodec decodeUtf8 TE.encodeUtf8
251 utf8Pure = mkCodec decodeUtf8Pure TE.encodeUtf8
254 utf16LE = mkCodec decodeUtf16LE TE.encodeUtf16LE
257 utf16BE = mkCodec decodeUtf16BE TE.encodeUtf16BE
260 utf32LE = mkCodec decodeUtf32LE TE.encodeUtf32LE
263 utf32BE = mkCodec decodeUtf32BE TE.encodeUtf32BE
265 decodeStream :: Monad m
266 => (B.ByteString -> DecodeResult)
267 -> Producer ByteString m r -> Producer Text m (Producer ByteString m r)
268 decodeStream = loop where
270 do x <- lift (next p)
272 Left r -> return (return r)
273 Right (chunk, p') -> case dec0 chunk of
274 DecodeResultSuccess text dec -> do yield text
276 DecodeResultFailure text bs -> do yield text
279 {-# INLINABLE decodeStream#-}
283 These are functions with the simple type:
285 > decodeUtf8 :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)
289 > decodeUtf8 = view utf8
290 > decodeUtf16LE = view utf16LE
292 and so forth, but these forms
293 may be more convenient (and give better type errors!) where lenses are
298 decodeUtf8 :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)
299 decodeUtf8 = decodeStream Stream.decodeUtf8
300 {-# INLINE decodeUtf8 #-}
302 decodeUtf8Pure :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)
303 decodeUtf8Pure = decodeStream Stream.decodeUtf8Pure
304 {-# INLINE decodeUtf8Pure #-}
306 decodeUtf16LE :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)
307 decodeUtf16LE = decodeStream Stream.decodeUtf16LE
308 {-# INLINE decodeUtf16LE #-}
310 decodeUtf16BE :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)
311 decodeUtf16BE = decodeStream Stream.decodeUtf16BE
312 {-# INLINE decodeUtf16BE #-}
314 decodeUtf32LE :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)
315 decodeUtf32LE = decodeStream Stream.decodeUtf32LE
316 {-# INLINE decodeUtf32LE #-}
318 decodeUtf32BE :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)
319 decodeUtf32BE = decodeStream Stream.decodeUtf32BE
320 {-# INLINE decodeUtf32BE #-}
324 These are simply defined
326 > encodeUtf8 = yield . TE.encodeUtf8
328 They are intended for use with 'for'
330 > for Text.stdin encodeUtf8 :: Producer ByteString IO ()
332 which would have the effect of
334 > Text.stdin >-> Pipes.Prelude.map (TE.encodeUtf8)
336 using the encoding functions from Data.Text.Encoding
339 encodeUtf8 :: Monad m => Text -> Producer' ByteString m ()
340 encodeUtf8 = yield . TE.encodeUtf8
341 encodeUtf16LE :: Monad m => Text -> Producer' ByteString m ()
342 encodeUtf16LE = yield . TE.encodeUtf16LE
343 encodeUtf16BE :: Monad m => Text -> Producer' ByteString m ()
344 encodeUtf16BE = yield . TE.encodeUtf16BE
345 encodeUtf32LE :: Monad m => Text -> Producer' ByteString m ()
346 encodeUtf32LE = yield . TE.encodeUtf32LE
347 encodeUtf32BE :: Monad m => Text -> Producer' ByteString m ()
348 encodeUtf32BE = yield . TE.encodeUtf32BE
350 mkCodec :: (forall r m . Monad m =>
351 Producer ByteString m r -> Producer Text m (Producer ByteString m r ))
352 -> (Text -> ByteString)
354 mkCodec dec enc = \k p0 -> fmap (\p -> join (for p (yield . enc))) (k (dec p0))
359 ascii and latin encodings only use a small number of the characters 'Text'
360 recognizes; thus we cannot use the pipes @Lens@ style to work with them.
361 Rather we simply define functions each way.
365 -- | 'encodeAscii' reduces as much of your stream of 'Text' actually is ascii to a byte stream,
366 -- returning the rest of the 'Text' at the first non-ascii 'Char'
368 encodeAscii :: Monad m => Producer Text m r -> Producer ByteString m (Producer Text m r)
369 encodeAscii = go where
370 go p = do e <- lift (next p)
372 Left r -> return (return r)
376 else let (safe, unsafe) = T.span (\c -> ord c <= 0x7F) chunk
377 in do yield (B8.pack (T.unpack safe))
380 else return $ do yield unsafe
383 {- | Reduce as much of your stream of 'Text' actually is iso8859 or latin1 to a byte stream,
384 returning the rest of the 'Text' upon hitting any non-latin 'Char'
386 encodeIso8859_1 :: Monad m => Producer Text m r -> Producer ByteString m (Producer Text m r)
387 encodeIso8859_1 = go where
388 go p = do e <- lift (next p)
390 Left r -> return (return r)
394 else let (safe, unsafe) = T.span (\c -> ord c <= 0xFF) txt
395 in do yield (B8.pack (T.unpack safe))
398 else return $ do yield unsafe
401 {- | Reduce a byte stream to a corresponding stream of ascii chars, returning the
402 unused 'ByteString' upon hitting an un-ascii byte.
404 decodeAscii :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)
405 decodeAscii = go where
406 go p = do e <- lift (next p)
408 Left r -> return (return r)
412 else let (safe, unsafe) = B.span (<= 0x7F) chunk
413 in do yield (T.pack (B8.unpack safe))
416 else return (do yield unsafe
419 {- | Reduce a byte stream to a corresponding stream of ascii chars, returning the
420 unused 'ByteString' upon hitting the rare un-latinizable byte.
422 decodeIso8859_1 :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)
423 decodeIso8859_1 = go where
424 go p = do e <- lift (next p)
426 Left r -> return (return r)
430 else do let (safe, unsafe) = B.span (<= 0xFF) chunk
431 yield (T.pack (B8.unpack safe))
434 else return (do yield unsafe