DPRG
DPRG List  



[DPRG] EmbedCV

Subject: [DPRG] EmbedCV
From: Clem Taylor clem.taylor at gmail.com
Date: Tue Oct 17 01:24:07 CDT 2006

On 10/16/06, Chris Jang <cjang at ix.netcom.com> wrote:
> My background is as a software and math person, so electronics is a weak
> point. Many of the people on the DPRG email list have either a background
> in electronics design engineering or deep experience in this area. I have
> some questions about C/C++ targeting DSPs and the practicalities involved.

So I'm somewhere in the middle, I'm a software guy, but I've only
worked for hardware companies. Right now, I'm doing computer vision
work for an intelligent camera company and before that I worked for a
fabless semiconductor company that designed SIMD DSPs for digital
cameras.

I've been up against the 'super flexible C++ code' vs. 'application
specific C code' debate a number of times. The problem is that in most
of the designs I've worked on, there is just barely enough theoretical
compute power to implement the initial design and by the time the
hardware is designed and the software is flushed out, there are all
kinds of additional requirements... Suddenly the barely enough
computes becomes not nearly enough. You are always attempting to
squeeze every last pixel out of the DSPs.

So writing code for performance becomes very critical. You can write
heavily templated code with all kinds of nifty C++isms. You'll end up
with nice flexible code, but it just won't run at the same speed as
clean C code. Some of the problem can be blamed on the compiler
technology not being up to snuff, but mostly it is an issue of over
generalization on the programmers side.

The common thing I've seen way too many times is code that has an
image class with get/put pixel routines which tends to result in code
that will do a function call, an address calculation
(pointer=base+stride*y+x) and then a read/write. Compare that to the
typical C code that just walks a pointer down a line of pixels.

> I'm writing an open source C++ template library for computer vision. It is
> named EmbedCV as in "Embedded Computer Vision". It's not rocket science
> stuff but captures the basics in an efficient implementation.

Cool, what types of problems are you trying to solve with this
library? Object tracking? Mapping? Path planning?

> 1. floating point neutrality - can do everything with unsigned integers
>     PRO: good for CPUs with crippled floating point or DSPs
>     CON: accuracy can suffer

So you don't need to throw away all the floating point code. The trick
is to know where floating point make sense and where it really hurts
performance. Some things can be overly painful in fixed point.
Typically all you really need to do is keep the floating point out of
the per-pixel loops and you are fine.

> 2. highly templatized - allow compiler to optimize as much as possible
>     PRO: fast, especially with loop unrolling
>     CON: less flexible for programming, a little harder to use

So the code generation of C++ from various compilers is all over the
map. Templating tends to be one of those places that will really show
off how bad a bad compiler is. :-)

> Perhaps floating point in hardware is now ubiquitous? So optimization for
> integer arithmetic is superfluous and backwards looking?

Floating point is ubiquitous in the general purpose processor market,
but it very much isn't in the video world. The typically ARM processor
in your cellphone or MIPS processor in your 802.11 router isn't going
to have a floating point unit. What little floating point is needed is
just emulated in software.

The typical DSP used for embedded video application isn't going to
have a floating point unit either. Most video/image processing
oriented DSPs are fixed point. The have instruction sets tuned for
doing fixed point operations. Most of this class of processor is
trying to get the most out of the available power budget. It is much
easier to make a fast integer unit then a fast FPU and that integer
unit will consume less power. Also, for the same amount of silicon,
you can get many integer ALUs for the same area as a single floating
point ALU. So I don't think fixed point processors are going anywhere.

> Also, what are the practicalities of multi-core DSPs? Are these in common use?

Multi-core DSPs are starting to become more and more common and that
trend is only going to continue. You can only scale the number of
functional units in a VLIW machine or the number of datapaths in a
SIMD machine so far. For some of the embarrassingly parallel image
processing problems you can take this scaling to the limit. I worked
on a SIMD processor that had one datapath per pixel column and would
operate on multiple lines of an image at a time. That machine would
plow through pixels, but it would often spend more time setting up the
loops then it would operating on an image worth of pixels. For the
more general processor, once you scale to a certain point, it is just
easier (for the hardware folks) to throw down more cores then it is to
increase the serial performance of the processor. Just ask Intel and
AMD...

The one big problem with multi-core DSPs is getting enough memory
bandwidth. The current system I'm working on has 4 720MHz DSPs, each
with a 64bit SDRAM interface. Many of our algorithms are memory bound,
not compute bound. So if they were to build a quad core DSP, it would
need to have 4x the memory bandwidth to compete what we are currently
doing. However, most of the multi-core designs are not able to scale
the off chip memory bandwidth as much (or as easily) as they can scale
the number of cores. Intel and AMD have been mitigating this my
spending lots of area on caches. This helps with general purpose
processing, but doesn't help as much with streaming workloads
[assuming the cache is smaller then the image (stream) size].

> If so, then does that imply multi-threadable library implementation
> to balance across the cores? Or does a compiler do this work for you? I
> have no experience in this area.

Don't hold your breath for the compiler to come to your rescue. It is
almost always up to the programmer to partition the algorithms on the
available cores. How you do this depends entirely on the problem you
are trying to parallelize...

                                        --Clem

More information about the DPRG mailing list

Copyright © 1984 - 2006 Dallas Personal Robotics Group. All rights reserved.
Website Design by NCC

For the latest robot news visit robots.net