Hardly a day goes by without the publication of some study on male-female differences based on a psychology, medicine, or neuroscience research paper. In some cases, Mark Liberman will then go and write a post explaining that the differences found in the study in question may be significant, but are not substantial - the differences in means are small and there's lots of overlap between the distributions (e.g.). But that's besides the point.
To get to the heart of why that is so, let's review the most basic concepts of sampling theory. Your aim in a study may be to find out something about a certain population, such as contemporary U.S. citizens. This - the totality of objects your claim will be about - is called the universe. Because it is typically not feasible to look at all of the universe, you draw a sample from this universe. Surprisingly enough, relatively small samples allow you to make statements about the universe if the sample is representative of that universe. (This is shown in introductory stats texts.)
For example, you may want to know how U.S. voters are going to vote in the upcoming elections. You draw a representative sample of eligible voters and try to measure how they're going to vote by asking them. (The asking method has its own problems, but that's not the point here.)
Let's apply this logic to studies of male-female differences. Right off the bat, there's a problem: What's the universe? Answering "males and females" doesn't help much: Are we thinking about all males and females that ever lived and will ever live? All currently living? All currently living in the U.S.? All currently living in the Boston Statistical Metropolitan Area?
In most studies of male-female differences there is no discussion of this question, perhaps because it might lead into the second question: Is your sample representative of that population? In most cases, researchers would be hard-pressed to argue that their sample is anywhere near representative of a meaningfully defined and substantially interesting universe. Try getting a representative sample into a psychology lab or a brain scanner. Good luck.
As a result of not using samples representative of males and females (in the current U.S., say), the measured differences are not meaningful. That is, we have no idea how they relate to the differences in the universe that we're actually interested in. And if you think that significance tests help, you need to read up on what significance tests do.
Some studies do use representative samples of well-defined universes. These can yield valid information. If some information you come across is based on a study done in a psychology or neuroscience lab, you may automatically assume that study is not of the valid type.