To add comments or start new threads please go to the full version of: Programming parallel computing
PhysOrgForum Science, Physics and Technology Discussion Forums > News discussions > Technology News

El_Machinae
http://www.physorg.com/news99067728.html

QUOTE
Despite the promise of almost unimagined computing power, however, even computing experts wonder whether this time the hardware developers have raced too far ahead of many programmers' ability to create software.


Clearly we're going to have to invent a new type of programming educational system. Program designers will have to think differently than program designers do nowadays.

I wonder what math subjects would be useful to know when trying to learn (or design) a way of programming for multiple cores? What new things should computer scientists and innovators be teaching themselves, in order for this field to progress?

It looks like 3D computing will be the future of computers. How will we make best use of them?
Guest_StevenA
Yes, to take advantage of faster computer architectures, programmers are going to need to work with different programming models which include physical considerations. In some ways this is already present in programming, for example, hard drives aren't treated identical to RAM and video bandwidth is generally recognized to be limited, but to make large gains software algorithms need to be redesigned from the bottom up and take advantage of large scale networks of simple computational elements (much like structures in the brain).

Anyone interested in this should google the terms "Reconfigurable computing" or another closely related industry is "Field Programmable Gate Arrays" or FPGAs. You can literally "program" electronic circuits to operate hundreds or in some cases even thousands of times faster than a typical general purpose processor, if you restrict yourself to working with large scale networks of simple binary computations (and this thankfully includes a large range of multimedia and artificial intelligence applications). The field is still in its infancy and there should be plenty of growth in the future for software development in this area.
yor_on
What you two wrote sounded very interesting El_Machinae and StevenA. So let's deal with what the hardware can deliver to us at the same time. As you look at your screen you might have sound on right. That makes two 'streams' the input you get from your eyes and the input you get from your ears. Is there any more inputs? I think those two are the main one's for a normal 'user'. So we need cool fast hardware for dealing with sound, and we need a cool parallel solution for dealing with 'pictures', am i right. As they are output in totally different modes they should be quite easy to develop i believe. And that would make your computer a true parallel mode machine. I'm not sure on this but i guess that those 'dual quadruple' core CPU:s only have one way out on the motherboard? Or do they have multiple ways out hardwired? If they do they have the possibility of being what i consider true paralleling mother card's otherwise they are still giving out bit by bit.

Eh, btw it would still be bit by bit my way :) sorry about that, but we would have two simultaneously computed bit streams executing at the same time. That was the point i was trying to make...
El_Machinae
I think that's a very good question, and should be fairly easy to explain for someone who knows more about computers than I do. I was wondering something similar the other day.

I very much want to get mathematics types recommended (for a person to study if they think they're going to get into this type of programming), though, by someone knowledgable in this field
yor_on
If you're thinking of programing the hardware i think StevenA give the directions, if you are wondering how to create simultaneous 'threads' that create a program i think you should check out Linux programming. It's basically free and they are developing nice GUI interfaces too. Or do you mean An Algorithm for how to execute parallel threads simultaneously? That's pure Math i think and when you developed that its time for the hardware industry to build to your specifications :)

BTW StevenA:s idea is a very fast way of creating programs too, and a different approach as i see it to 'threading' so check it out..
StevenA
The architectures I'm thinking of are very "fine grained". You wouldn't have only dual or quad processors but effectively thousands of small processors and processess operating in parallel. This isn't done easily for most programs (some algorithms don't have that amount of parallelism available, but most the multimedia ones that we need a lot more computational power for work well in a highly parallel architecture).

You can not only improve computational power but you can simultaneously reduce power requirements. The typical power consumption of a digital switching circuit increases proportional to the square of the maximum clock frequency (it's linear with clock frequency once the circuit is developed by higher speed processes increase the power requirements per clock so it ends up being a square relationship).

So if you took a processor running at 1,000 MHz and broke it up into 100 processes running at 10 MHz (that's wishful thinking though in that it's rarely 100% efficient in this manner, but ignoring that ...), then each of the new processors can run at approximately 1/100^2 or .01% of the original power consumption, which ends up being only 1% of the power that performing all the operations in a single processor would. That's how the brain operates as well. It uses low voltages and slow switching speeds but in a massively parallel fashion to give the best of both worlds (for limited applications ... we can't mentally do a single serial mathematical computation nearly as fast as a silicon processor but we can scan a crowd for a face pretty effectiently, especially in terms of power consumption).

I actually wrote a compiler that converts programs written in a reduced set of C instructions and compiles it in to an electronic circuit that executes the program (it works very well for small programs but doesn't handle memory structures well and large arithmetic operations, but it optimizes smaller circuits better than many hand designs would be ... my cousin has worked on similar designs)

Anyway, there's literally a gold mine waiting for people who can find ways to effeciently convert typical serial software code into highly parallel electronic circuit equivalents (though not all programs are amenable to this, but many are). It's still a growing industry and actually Intel uses technology similar to this inside the current generation processors by translating higher level serial processor instructions into lower level, finer grained, parallel processes but there are inefficiencies both in the memory bandwidth as well as overhead in the translation. A great architecture would be to integrate memory and processors into a single package and then when you "upgraded" your memory, you'd be upgrading your processing power as well and each processor could have a very high memory bandwidth by being directly integrated with the memory.

One of my favorite, inexpensive, yet potentially very powerful family of parts is the Spartan 3 series from Xilinx. http://www.xilinx.com/products/silicon_sol...eries/index.htm (They even provide free design tools for download for people interested in digital design)

For a sample of their potential power, for ~$30 you can get a part that has dedicated multipliers and adders that can perform (integer) multiplication and addition roughly 20 times as fast as many PCs and on top of that you have tens of thousands of small programmable digital function generators that each can perform hundreds of millions of computations per second. For some algorithms (with a lot of work and careful layout) you can calculate things 100-1000 times faster than a PC on a part that costs less (realistically though you're more likely to see 2-10 times faster execution possible for maybe half the software typically written for a PC, but the trick is to not write a typical application for them so that you can avoid the inefficiencies and really make use of the power). If someone could write a very general purpose compiler from a language like C to an FPGA, it would be revolutionary for the computing industry (you can see lots of attempts but they either require a lot of software rewriting or they only work with small and limited programs).

On the other hand, if compilers become good enough at translating programs into large parallel processes, then this transformation in hardware architectures could end up going largely unnoticed by programmers (as the compilers and hardware do the translation in the background without needing programmers to alter their algorithms), though you'd still expect a program intentionally designed to be parallel would have an advantage in this process.

Multithreading is a great intermediate solution. A programmer doesn't have to write specifically for the hardware but just concentrates on breaking up things in to parallel processes and then lets the compiler do the mapping to hardware. Multithreading is pretty easy for programmers, but still not optimal for the hardware - it's halfway there.
yor_on
Yeah i thought it was something like that :)
But to make the software to synchronize them as for right 'timing' sequence etc?
Probably it would be easiest to have a conventional computer to create the overriding software? Or is there a better way. If you wanted a convertible, kind of.
yor_on
btw i've been thinking of buying a PS3 and use a linux compilation on it. check the specs. It seems awfully nice, doesn't it :) The only thing stopping me is that you don't seem to be able to upgrade the memory. Maybe you can work around it using CompactFlash. I don't know?

CPU: Cell Processor

* PowerPC-base Core @3.2GHz
* 1 VMX vector unit per core
* 512KB L2 cache
* 7 x SPE @3.2GHz
* 7 x 128b 128 SIMD GPRs
* 7 x 256KB SRAM for SPE
* * 1 of 8 SPEs reserved for redundancy total floating point performance: 218 GFLOPS

GPU: RSX @550MHz
* 1.8 TFLOPS floating point performance
* Full HD (up to 1080p) x 2 channels
* Multi-way programmable parallel floating point shader pipelines

Sound: Dolby 5.1ch, DTS, LPCM, etc. (Cell-base processing)

Memory:
* 256MB XDR Main RAM @3.2GHz
* 256MB GDDR3 VRAM @700MHz

System Bandwidth:
* Main RAM: 25.6GB/s
* VRAM: 22.4GB/s
* RSX: 20GB/s (write) + 15GB/s (read)
* SB: 2.5GB/s (write) + 2.5GB/s (read)

System Floating Point Performance: 2 TFLOPS
Storage:
* HDD
* Detachable 2.5” HDD slot x 1

I/O:
* USB: Front x 4, Rear x 2 (USB2.0)
* Memory Stick: standard/Duo, PRO x 1
* SD: standard/mini x 1
* CompactFlash: (Type I, II) x 1

Communication: Ethernet (10BASE-T, 100BASE-TX, 1000BASE-T) x3 (input x 1 + output x 2)

Wi-Fi: IEEE 802.11 b/g
Bluetooth: Bluetooth 2.0 (EDR)

Controller:
* Bluetooth (up to 7)
* USB2.0 (wired)
* Wi-Fi (PSP®)
* Network (over IP)

AV Output:
* Screen size: 480i, 480p, 720p, 1080i, 1080p
* HDMI: HDMI out x 2
* Analog: AV MULTI OUT x 1
* Digital audio: DIGITAL OUT (OPTICAL) x 1

CD Disc media (read only):
* PlayStation CD-ROM
* PlayStation 2 CD-ROM
* CD-DA (ROM), CD-R, CD-RW
* SACD Hybrid (CD layer), SACD HD
* DualDisc (audio side), DualDisc (DVD side)

DVD Disc media (read only):
* PlayStation 2 DVD-ROM
* PLAYSTATION 3 DVD-ROM
* DVD-Video: DVD-ROM, DVD-R, DVD-RW, DVD+R, DVD+RW

Blu-ray Disc media (read only):
* PLAYSTATION 3 BD-ROM
* BD-Video: BD-ROM, BD-R, BD-RE
StevenA
QUOTE (yor_on+Jul 13 2007, 06:42 AM)
btw i've been thinking of buying a PS3 and use a linux compilation on it. check the specs. It seems awfully nice, doesn't it smile.gif The only thing stopping me is that you don't seem to be able to upgrade the memory. Maybe you can work around it using CompactFlash. I don't know?

CPU: Cell Processor

    * PowerPC-base Core @3.2GHz
    * 1 VMX vector unit per core
    * 512KB L2 cache
    * 7 x SPE @3.2GHz
    * 7 x 128b 128 SIMD GPRs
    * 7 x 256KB SRAM for SPE
    * * 1 of 8 SPEs reserved for redundancy total floating point performance: 218 GFLOPS

GPU: RSX @550MHz
    * 1.8 TFLOPS floating point performance
    * Full HD (up to 1080p) x 2 channels
    * Multi-way programmable parallel floating point shader pipelines

Sound: Dolby 5.1ch, DTS, LPCM, etc. (Cell-base processing)

Memory:
    * 256MB XDR Main RAM @3.2GHz
    * 256MB GDDR3 VRAM @700MHz

System Bandwidth:
    * Main RAM: 25.6GB/s
    * VRAM: 22.4GB/s
    * RSX: 20GB/s (write) + 15GB/s (read)
    * SB: 2.5GB/s (write) + 2.5GB/s (read)

System Floating Point Performance: 2 TFLOPS
Storage:
    * HDD
    * Detachable 2.5” HDD slot x 1

I/O:
    * USB: Front x 4, Rear x 2 (USB2.0)
    * Memory Stick: standard/Duo, PRO x 1
    * SD: standard/mini x 1
    * CompactFlash: (Type I, II) x 1

Communication: Ethernet (10BASE-T, 100BASE-TX, 1000BASE-T) x3 (input x 1 + output x 2)

Wi-Fi: IEEE 802.11 b/g
Bluetooth: Bluetooth 2.0 (EDR)

Controller:
    * Bluetooth (up to 7)
    * USB2.0 (wired)
    * Wi-Fi (PSP®)
    * Network (over IP)

AV Output:
    * Screen size: 480i, 480p, 720p, 1080i, 1080p
    * HDMI: HDMI out x 2
    * Analog: AV MULTI OUT x 1
    * Digital audio: DIGITAL OUT (OPTICAL) x 1

CD Disc media (read only):
    * PlayStation CD-ROM
    * PlayStation 2 CD-ROM
    * CD-DA (ROM), CD-R, CD-RW
    * SACD Hybrid (CD layer), SACD HD
    * DualDisc (audio side), DualDisc (DVD side)

DVD Disc media (read only):
    * PlayStation 2 DVD-ROM
    * PLAYSTATION 3 DVD-ROM
    * DVD-Video: DVD-ROM, DVD-R, DVD-RW, DVD+R, DVD+RW

Blu-ray Disc media (read only):
    * PLAYSTATION 3 BD-ROM
    * BD-Video: BD-ROM, BD-R, BD-RE


I like the way you think. That's not a bad idea (and I bet you'd never have to worry about getting a virus either biggrin.gif)
El_Machinae
Okay, so if I have a teenager who's thinking about being a programmer (and is bright enough to realise that parallel programming is the future) then he's going to want to start exposing himself to certain flavours of mathematics. This would be to start encouraging a certain type of mathematical thinking which would be conducive to being able to design parallel programs.

I'm wonder what mathematical flavours would be helpful seeds in getting the most useful type of thinking?
StevenA
QUOTE (El_Machinae+Jul 13 2007, 01:04 PM)
Okay, so if I have a teenager who's thinking about being a programmer (and is bright enough to realise that parallel programming is the future) then he's going to want to start exposing himself to certain flavours of mathematics. This would be to start encouraging a certain type of mathematical thinking which would be conducive to being able to design parallel programs.

I'm wonder what mathematical flavours would be helpful seeds in getting the most useful type of thinking?


That's a good question. Well, to take full advantage of digital electronics (not that we'll necessarily always be working with binary, but not likely to be supplanted soon but taking advantage of the analog characteristics of transistors, for example, can give large boosts in computational power for some applications), a good background in digital arithmetic and logic is very useful (for example here's the structure of a typical rather general purpose computational unit in terms of binary logic http://en.wikipedia.org/wiki/Arithmetic_logic_unit). For an easy way to design such circuits without spending tens of thousands of dollars to design a custom integrated circuit, there's programmable logic http://en.wikipedia.org/wiki/Programmable_logic of which CPLDs and FPGAs are the current best options.

In terms of high performance computation, a good buzzword is "Digital Signal Processing" or DSPs (http://en.wikipedia.org/wiki/Digital_signal_processing), in order to make these algorithms efficient in hardware, as this thread is about, parallel processing techniques can be applied in various ways like systolic arrays http://en.wikipedia.org/wiki/Systolic_array (though ultimately we're likely to see 3 dimensional arrays, for now hardware technology is generally restricted to 2 dimensional arrays). A few applications that are both very general purpose and amenable to most all of these are 1) artificial neural networks 2) genetic, stochastic and Boltzmann algorithms and 3) matrix multiplication. It's very interesting that these closely match 'natural' forms of computation (though not unexpected).

Here's a link to a goodle search of "FPGA systolic pipelined algorithms" http://www.google.com/search?hl=en&q=FPGA+...G=Google+Search. You basically see a ton of whitepapers on research in these areas as they tend to be the most efficient forms of computation with current silicon technology.

A good subject that's not often recognized as related is Computational Geometry http://www.google.com/search?hl=en&q=FPGA+...G=Google+Search. It's not directly related but computing naturally occurs within some physical environment and algorithms that specifically work with geometric structures are often easily converted into large arrays of computing elements.

Another very interesting possibility is for computation being performed on even finer scales using Cellular Automata http://www.google.com/search?hl=en&q=Cellu...ata&btnG=Search - these are just vast seas of very small and indentical computing elements but with each having a specific state stored and only communicating with nearby elements .... blazing fast in terms of computations per second, but limited in what algorithms they can easily compute. Part of the trick is in finding ways of mapping typical algorithms into something compatible with such an architecture (the nice thing with programmable logic is that you can implement quite a variety of architectures and make tradeoffs where desired. There isn't any specific architecture that's best for all forms of computation, but that's part of the challenge in finding the what physical structures operate best for different forms of computation and usually they fall in to a few classes of architectures)
yor_on
How old would that teenager be? If he want's to build in software just to test his ideas :) I think there are programs for that on the net. At least they exist for sound devices in linux. And they are for 'real', even though you create them in software. I have a feeling that someone :) on this forum would be able to point you to a 'proper' program.

Insert a she instead of a he if that's the case.
El_Machinae
Well, there aren't that many types of parallel processors to currently practice on. I think that it will require an entirely different style of thinking than we're currently familiar with programming in. Just like calculus helps early physics make sense, and matrices helps statistics make sense ... I'm sure there's mathematical fields which will be useful, because they encourage a different style of thinking.
Zephir
The parallel computing is more widespread, then you can even imagine. The computation of shader effects in computer games is mostly a parallel computing application. You can check the CODA project of NVidia Corp, or the smaller one Microsoft Research Accelerator. project for practical implementation of parallel operations with the large 2D arrays, which is using the parallel GPU units of modern graphics cards for such purpose. Bellow is the generic example of shader solution of Laplace equation in VB.NET - the parallel program is marked by blue color.

CODE
Imports System, System.Drawing, System.Windows.Forms, Microsoft.DirectX, Microsoft.DirectX.Direct3D
Class FMain: Inherits Form
 Shared F As FMain, D As Device, E As Effect, S As Sprite, T As Texture, N% = 256
 Shared Sub Main()
   F = New FMain: F.ClientSize = New Size(N, N): F.Show()
   Dim PP As New PresentParameters: PP.Windowed = 1: PP.SwapEffect = 1: PP.PresentationInterval = PresentInterval.Immediate
   D = New Device(0, 1, F.Handle, 32, PP): S = New Sprite(D)
   E = E.FromString(D, _
   "float dx, dy; sampler2D S;" & _
   "float4 Blur(float2 T: TEXCOORD0): COLOR {" & _
   "  float4 L = tex2D(S, float2(T.x - dx, T.y));" & _
   "  float4 R = tex2D(S, float2(T.x + dx, T.y));" & _
   "  float4 B = tex2D(S, float2(T.x, T.y - dy));" & _
   "  float4 N = tex2D(S, float2(T.x, T.y + dy));" & _
   "  return ((L + R + B + N) / 4.0f);" & _
   "}" & _
   "technique Simulace {" & _
   "  pass P0 {" & _
   "    PixelShader = compile ps_2_0 Blur();" & _
   "}}", Nothing, 0, Nothing)
   E.Technique = E.GetTechnique("Simulace")
   E.SetValue(EffectHandle.FromString("dx"), 1! / N): E.SetValue(EffectHandle.FromString("dy"), 1! / N)
   T = New Texture(D, N, N, 1, 1, Format.A16B16G16R16F, 0)
   F.Render2Texture(New Texture(D, New Bitmap(N, N), 0, 1), T)
   E.Begin(0)
   While F.Created
     F.Render2Texture(T, T, 0)
     D.BeginScene
     S.Begin(0): S.Draw2D(T, Nothing, 0, New Point(0, 0), Color.White): S.End
     D.EndScene
     D.Present: Application.DoEvents
   End While
   E.End: S.Dispose: D.Dispose
 End Sub
 Sub Render2Texture(tSrc As Texture, tDst As Texture, Optional iPass% = -1)
   Dim sOld As Surface = D.GetRenderTarget(0)
   D.SetRenderTarget(0, tDst.GetSurfaceLevel(0))
   D.BeginScene(): S.Begin(0)
   If iPass >= 0 Then E.BeginPass(iPass)
   S.Draw2D(tSrc, Nothing, 0, New Point(0, 0), Color.White)
   If iPass >= 0 Then E.EndPass
   S.End: D.EndScene
   D.SetRenderTarget(0, sOld)
 End Sub
End Class[/color]

The principle of shader parallel computing si simple: the rectangular area of bixels (i.e. 2D bitmaps) is rendered by GPU, but not on the screen, but to the off-screen buffer. During rendering the shader microprogram is applied to the color of each pixel. After rendering the addresses of off-screen and on screen buffers are switched, so that the shader microcode is applied to the resulting texture, repetitively. The usage of Accelerator/CODA wrappers simplifies the parrallel computing significantly. Bellow is the example of usage of MS Accelerator PCL:
CODE

FPA fpa = new FPA(3.0f, new int[] {2, 3});        // Creates a 2x3 constant array
using (DFPA a = new DFPA(1.0f, 2.0f, 3.0f, 4.0f, 5.0f))        {
    float[] result;
    FPA scale = FPA.MaxVal(FPA.Abs(a));
    scale = FPA.Cmp(scale, scale, 1.0f);
    FPA s = a / scale.Replicate(a.Shape);
    s = scale * FPA.Sqrt(FPA.Sum(s * s));
    s.ToArray(out result);
}

The power of GPU computing is significant, in general case its speed is 20 - 50 times higher, then the speed of C/C++ CPU code on common computer. Due the enhancements of shader programming language in recent time, the shaders routines can be written is human readable syntax, similar to the C language. In the form of graphics accelerator, nearly everybody can have a single-chip scalar supercomputer array in his kitchen, ready for your supercomputing experiments! You can check for example the particle engine or 3D Fluid simulation project as a practical illustration of parallel GPU computing.

User posted image
yor_on
Cool Zephir, i just knew there was someone, this is a good forum.
PhysOrg scientific forums are totally dedicated to science, physics, and technology. Besides topical forums such as nanotechnology, quantum physics, silicon and III-V technology, applied physics, materials, space and others, you can also join our news and publications discussions. We also provide an off-topic forum category. If you need specific help on a scientific problem or have a question related to physics or technology, visit the PhysOrg Forums. Here you’ll find experts from various fields online every day.
To quit out of "lo-fi" mode and return to the regular forums, please click here.